Searching the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Searching the Web

Description:

Different data models: relational, object-oriented. Different ... 'Keanu Reeves' or 'Reeves, K.' etc. Limited query capabilities. Mediator caching. Challenges ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 36
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:
Tags: keanu | reeves | searching | web

less

Transcript and Presenter's Notes

Title: Searching the Web


1
Searching the Web
Junghoo Cho UCLA Computer Science
2
Information Galore
Biblio sever
Legacy database
Plain text files
3
Information Overload Problem
4
Solution
  • Indexing approach
  • Google, Excite, AltaVista
  • Integration approach
  • MySimon, BizRate

5
Indexing Approach
Central Index
6
Challenges
  • Page selection and download
  • What page to download?
  • Page and index update
  • How to update pages?
  • Page ranking
  • What page is important or relevant?
  • Scalability

7
Integration Approach
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
8
Challenges
  • Heterogeneous sources
  • Different data models relational,
    object-oriented
  • Different schemas and representations
  • Keanu Reeves or Reeves, K. etc.
  • Limited query capabilities
  • Mediator caching

9
Focus of the Talk
  • Indexing approach
  • How to maintain pages up-to-date?

10
Outline of This Talk
  • How can we maintain pages fresh?
  • How does the Web change?
  • What do we mean by fresh pages?
  • How should we refresh pages?

11
Web Evolution Experiment
  • How often does a Web page change?
  • How long does a page stay on the Web?
  • How long does it take for 50 of the Web to
    change?
  • How do we model Web changes?

12
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest PageRank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

13
Average Change Interval
fraction of pages
¾
¾
average change interval
14
Change Interval By Domain
fraction of pages
¾
¾
average change interval
15
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT (t) ? e-? t (t gt 0)

16
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
17
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

18
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

19
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
20
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit one page per week
  • How should we visit pages?
  • e1 e2 e1 e2 e1 e2 e1 e2... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • ?

e1
e1
e2
e2
web
database
21
Proportional Often Not Good!
  • Visit fast changing e1
  • ? get 1/2 day of freshness
  • Visit slow changing e2
  • ? get 1/2 week of freshness
  • Visiting e2 is a better deal!

22
Optimal Refresh Frequency
  • Problem
  • Given and f ,
  • find
  • that maximize

23
Optimal Refresh Frequency
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

24
Optimal Refresh for Age
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

25
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
26
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
27
Weighted Freshness
f
w 2
w 1
l
28
Change Frequency Estimation
  • How to estimate change frequency?
  • Naïve Estimator X/T
  • X number of detected changes
  • T monitoring period
  • 2 changes in 10 days 0.2 times/day
  • Incomplete change history

29
Improved Estimator
  • Based on the Poisson model
  • X number of detected changes
  • N number of accesses
  • f access frequency
  • 3 changes in 10 days 0.36 times/day
  • ? Accounts for missed changes

30
Improvement Significant?
  • Application to a Web crawler
  • Visit pages once every week for 5 weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Uniform do not adjust
  • Naïve based on the naïve estimator
  • Ours based on our improved estimator

31
Improvement from Our Estimator
(9,200,000 visits in total)
32
Summary
  • Information overload problem
  • Indexing approach
  • Integration approach
  • Page update
  • Web evolution experiment
  • Change metric
  • Refresh policy
  • Frequency estimator

33
Research Opportunity
  • Efficient query processing?
  • Automatic source discovery?
  • Automatic data extraction?

34
Web Archive Project
  • Can we store the history of the Web?
  • Web is ephemeral
  • Study of the Evolution of the Web
  • Challenges
  • Update policy?
  • Compression?
  • New storage structure?
  • New index structure?

35
The End
  • Thank you for your attention
  • For more information visit
  • http//www.cs.ucla.edu/cho/
Write a Comment
User Comments (0)
About PowerShow.com