Title: Modeling and Managing Content Changes in Text Databases
1Modeling and Managing Content Changes in Text
Databases
- Panos Ipeirotis
- New York University
Alexandros NtoulasUCLA
Junghoo ChoUCLA
Luis GravanoColumbia University
2Metasearchers Provide Access to Text Databases
- Large number of hidden-web databases available
- Contents not accessible through Google
- Need to query each database separately
Broadcasting queries to all databases not
feasible (100,000 DBs)
thrombopenia
Metasearcher
PubMed
NYTimesArchives
USPTO
3Metasearchers Provide Access to Text Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
?
?
?
PubMed
NYTimesArchives
USPTO
... thrombopenia 26,887 ...
... thrombopenia 42 ...
... thrombopenia 0 ...
4Extracting Content Summaries from Text Databases
- For hidden-web databases (query-only access)
- Send queries to database
- Retrieve top matching documents
- Use document sample as database representative
- For crawlable databases
- Retrieve documents by following links (crawling)
- Stop when all documents retrieved
PubMed (11,868,552 documents) Word
Documents aids 123,826 cancer 1,598,896
heart 706,537hepatitis
124,320thrombopenia 26,887
- Content summary contains
- Words in sample (or crawl)
- Document frequency of each word in sample (or
crawl)
5Never-update Policy
- Current practice construct summary once, never
update - Extracted (old) summary may
- Miss new words (from new documents)
- Contain obsolete words (from deleted document)
- Provide inaccurate frequency estimates
NY Times (Mar 29, 2005) Word Docs
NY Times (Oct 29, 2004) Word Docs
- tsunami (0)
- recount 2,302
- grokster 2
- tsunami 250
- recount (0)
- grokster 78
6Research Challenge
- Updating summaries is costly!
- Challenge
- Maintain good quality of summaries, and
- Minimize number of updates
- If summaries do not change ? Problem solved!
- If summaries change ? Estimate rate of change
and schedule updates
7Outline
- Do content summaries change over time?
- Which database properties affect the rate of
change? - How to schedule updates with constrained
resources?
8Data for our Study 152 Web Databases
- Randomly picked from Open Directory
- Multiple domains
- Multiple topics
- Searchable (to construct summaries by querying)
- Crawlable (to retrieve full contents)
www.wsj.com, www.intellihealth.com, www.fda.gov,
www.si.edu,
9Data for our Study 152 Web Databases
- Study period Oct 2002 Oct 2003
- 52 weekly snapshots for each database
- 5 million pages in each snapshot (approx.)
- 65 Gb per snapshot (3.3 Tb total)
- For each week and each database, we built
- Complete summary (by scanning all pages)
- Approximate summary (by query-based sampling)
10Measuring Changes over Time
- Recall How many words in current summary also in
old (extracted) summary? - Shows how well old summaries cover the current
(unknown) vocabulary - Higher values are better
- Precision How many words in old (extracted)
summary still in current summary? - Shows how many obsolete words exist in the old
summaries - Higher values are better
Results for complete summaries (similar for
approximate)
11Summaries over Time Conclusions
- Databases (and their summaries) are not static
- Quality of old summaries deteriorates over time
- Quality decreases for both complete and
approximate content summaries (see paper for
details)
How often should we refresh the summaries?
12Outline
- Do content summaries change over time?
- Which database properties affect the rate of
change? - How to schedule updates with constrained
resources?
13Survival Analysis
Survival Analysis A collection of statistical
techniques for predicting the time until an
event occurs
- Initially used to measure length of survival of
patients under different treatments (hence the
name) - Used to measure effect of different parameters
(e.g., weight, race) on survival time - We want to predict time until next update and
find database properties that affect this time
14Survival Analysis for Summary Updates
- Survival time of summary Time until current
database summary is sufficiently different than
the old one (i.e., an update is required) - Old summary changes at time t if
- KL divergence(current,
old) gt t - Survival analysis estimates probability that a
database summary changes within time t
change sensitivity threshold
15Modeling Goals
- Goal Estimate database-specific survival time
distribution - Exponential distribution S(t) exp(-?t) common
for survival times - ? captures rate of change
- Need to estimate ? for each database
- Preferably, infer ? from database properties
(with no training) - Intuitive (and wrong) approach data multiple
regression - Study contains a large number of incomplete
observations - Target variable S(t) typically not Gaussian
16Survival Times and Incomplete Data
Survival times for a database
week
- Many observations are incomplete (aka
censored) - Censored data give partial information (database
did not change)
17Using Censored Data
X
- By ignoring censored cases we get (under)
estimates ? perform more update operations than
needed - By using censored cases as-is we get (again)
underestimates - Survival analysis extends the lifetime of
censored cases
18Database Properties and Survival Times
- For our analysis, we use Cox Proportional Hazards
Regression - Uses effectively censored data (i.e., database
did not change within time T) - Derives effect of database properties on rate of
change - E.g., if you double the size of a database, it
changes twice as fast - No assumptions about the form of the survival
function
19Cox PH Regression Results
- Examined effect of
- Change-sensitivity threshold t
- Topic
- Size
- Number of words
- Differences of summaries extracted in consecutive
weeks - Domain
(higher t ? longer survival)
(does not matter, except for health-related sites)
(larger databases change faster!)
(does not matter)
(sites that changed frequently in the past,
change frequently in the future)
(details in next slide)
Rate of change increases Rate of change decreases
20Baseline Survival Functions by Domain
- Effect of domain
- GOV changes slower than any other domain
- EDU changes fast in the short term, but slower in
the long term - COM and other commercial sites change faster than
the rest
21Results of Cox PH Analysis
- Cox PH analysis gives a formula for predicting
the time between updates for any database - Rate of change depends on
- domain
- database size
- history of change
- threshold t
By knowing time between updates we can schedule
update operations better!
22Outline
- Do content summaries change over time?
- Which database properties affect the rate of
change? - How to schedule updates with constrained
resources?
23Deriving an Update Policy
- Naïve policy
- Updates all databases at the same time (i.e.,
assumes identical change rates) - Suboptimal use of resources
- Our policy
- Use change rate as predicted by survival analysis
- Exploit database-specific estimates for rate of
change
24Scheduling Updates
With plentiful resources, we update sites
according to their rate of change
When resources are constrained, we update less
often sites that change too frequently
25Scheduling Results
- Clever scheduling improves quality of summaries
(according to KL, precision and recall) - Our policy allows users to select optimally
change thresholds according to available
resources, or vice versa. (see paper)
26Updating Content Summaries Contributions
- Extensive experimental study (1 year, 152
dbases) established the need to update
periodically statistics (summaries) for text
databases - Change frequency model showed that database
characteristics can predict time between updates - Scheduling algorithms devised update policies
that exploit survival model and use efficiently
available resources
27Current and Future Work
- Current
- Compared with machine learning techniques
- Applied technique for web crawling
- Future
- Apply survival analysis for refreshing db
statistics - (materialized views, index statistics, )
- Examine efficiency of survival analysis models
- Create generative models for modeling database
changes
28Thank you! (?????)
29Related Work
- Brewington Cybenko, WWW9, Computer 2000
- Cho Molina, VLDB 2000, SIGMOD 2000, TOIT 2003
- Coffman, J.Scheduling, 1998
- Olston Widom, SIGMOD 2002
30Measuring Changes over Time
- KL divergence How similar is the word
distribution in old and current summaries? - Identical summaries KL0
- Higher values are worse
Results for complete summaries (similar for
approximate)