Title: Searching the Web
1Searching the Web
Junghoo Cho UCLA Computer Science
2Information Galore
Biblio sever
Legacy database
Plain text files
3Information Overload Problem
4Solution
- Indexing approach
- Google, Excite, AltaVista
- Integration approach
- MySimon, BizRate
5Indexing Approach
Central Index
6Challenges
- Page selection and download
- What page to download?
- Page and index update
- How to update pages?
- Page ranking
- What page is important or relevant?
- Scalability
7Integration Approach
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
8Challenges
- Heterogeneous sources
- Different data models relational,
object-oriented - Different schemas and representations
- Keanu Reeves or Reeves, K. etc.
- Limited query capabilities
- Mediator caching
9Focus of the Talk
- Indexing approach
- How to maintain pages up-to-date?
10Outline of This Talk
- How can we maintain pages fresh?
- How does the Web change?
- What do we mean by fresh pages?
- How should we refresh pages?
11Web Evolution Experiment
- How often does a Web page change?
- How long does a page stay on the Web?
- How long does it take for 50 of the Web to
change? - How do we model Web changes?
12Experimental Setup
- February 17 to June 24, 1999
- 270 sites visited (with permission)
- identified 400 sites with highest PageRank
- contacted administrators
- 720,000 pages collected
- 3,000 pages from each site daily
- start at root, visit breadth first (get new old
pages) - ran only 9pm - 6am, 10 seconds between site
requests
13Average Change Interval
fraction of pages
¾
¾
average change interval
14Change Interval By Domain
fraction of pages
¾
¾
average change interval
15Modeling Web Evolution
- Poisson process with rate ?
- T is time to next event
- fT (t) ? e-? t (t gt 0)
16Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
17Change Metrics
- Freshness
- Freshness of element ei at time t is F (
ei t ) 1 if ei is up-to-date at time t
0 otherwise
18Change Metrics
- Age
- Age of element ei at time t is A( ei t
) 0 if ei is up-to-date at time t
t - (modification ei time)
otherwise
19Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
20Trick Question
- Two page database
- e1 changes daily
- e2 changes once a week
- Can visit one page per week
- How should we visit pages?
- e1 e2 e1 e2 e1 e2 e1 e2... uniform
- e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
- e1 e1 e1 e1 e1 e1 ...
- e2 e2 e2 e2 e2 e2 ...
- ?
e1
e1
e2
e2
web
database
21Proportional Often Not Good!
- Visit fast changing e1
- ? get 1/2 day of freshness
- Visit slow changing e2
- ? get 1/2 week of freshness
- Visiting e2 is a better deal!
22Optimal Refresh Frequency
- Problem
- Given and f ,
- find
-
- that maximize
23Optimal Refresh Frequency
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
24Optimal Refresh for Age
- Shape of curve is the same in all cases
- Holds for any change frequency distribution
25Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
26Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
27Weighted Freshness
f
w 2
w 1
l
28Change Frequency Estimation
- How to estimate change frequency?
- Naïve Estimator X/T
- X number of detected changes
- T monitoring period
- 2 changes in 10 days 0.2 times/day
- Incomplete change history
29Improved Estimator
- Based on the Poisson model
-
- X number of detected changes
- N number of accesses
- f access frequency
- 3 changes in 10 days 0.36 times/day
- ? Accounts for missed changes
30Improvement Significant?
- Application to a Web crawler
- Visit pages once every week for 5 weeks
- Estimate change frequency
- Adjust revisit frequency based on the estimate
- Uniform do not adjust
- Naïve based on the naïve estimator
- Ours based on our improved estimator
31Improvement from Our Estimator
(9,200,000 visits in total)
32Summary
- Information overload problem
- Indexing approach
- Integration approach
- Page update
- Web evolution experiment
- Change metric
- Refresh policy
- Frequency estimator
33Research Opportunity
- Efficient query processing?
- Automatic source discovery?
- Automatic data extraction?
34Web Archive Project
- Can we store the history of the Web?
- Web is ephemeral
- Study of the Evolution of the Web
- Challenges
- Update policy?
- Compression?
- New storage structure?
- New index structure?
35The End
- Thank you for your attention
- For more information visit
- http//www.cs.ucla.edu/cho/