Modeling and Managing Content Changes in Text Databases - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Modeling and Managing Content Changes in Text Databases

Description:

Retrieve documents by following links (crawling) Stop when all documents retrieved ... Words in sample (or crawl) Document frequency of each word in sample (or crawl) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 31
Provided by: panagi2
Category:

less

Transcript and Presenter's Notes

Title: Modeling and Managing Content Changes in Text Databases


1
Modeling and Managing Content Changes in Text
Databases
  • Panos Ipeirotis
  • New York University

Alexandros NtoulasUCLA
Junghoo ChoUCLA
Luis GravanoColumbia University
2
Metasearchers Provide Access to Text Databases
  • Large number of hidden-web databases available
  • Contents not accessible through Google
  • Need to query each database separately

Broadcasting queries to all databases not
feasible (100,000 DBs)
thrombopenia
Metasearcher
PubMed
NYTimesArchives
USPTO
3
Metasearchers Provide Access to Text Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
?
?
?
PubMed
NYTimesArchives
USPTO
... thrombopenia 26,887 ...
... thrombopenia 42 ...
... thrombopenia 0 ...
4
Extracting Content Summaries from Text Databases
  • For hidden-web databases (query-only access)
  • Send queries to database
  • Retrieve top matching documents
  • Use document sample as database representative
  • For crawlable databases
  • Retrieve documents by following links (crawling)
  • Stop when all documents retrieved

PubMed (11,868,552 documents) Word
Documents aids 123,826 cancer 1,598,896
heart 706,537hepatitis
124,320thrombopenia 26,887
  • Content summary contains
  • Words in sample (or crawl)
  • Document frequency of each word in sample (or
    crawl)

5
Never-update Policy
  • Current practice construct summary once, never
    update
  • Extracted (old) summary may
  • Miss new words (from new documents)
  • Contain obsolete words (from deleted document)
  • Provide inaccurate frequency estimates

NY Times (Mar 29, 2005) Word Docs
NY Times (Oct 29, 2004) Word Docs
  • tsunami (0)
  • recount 2,302
  • grokster 2
  • tsunami 250
  • recount (0)
  • grokster 78

6
Research Challenge
  • Updating summaries is costly!
  • Challenge
  • Maintain good quality of summaries, and
  • Minimize number of updates
  • If summaries do not change ? Problem solved!
  • If summaries change ? Estimate rate of change
    and schedule updates

7
Outline
  • Do content summaries change over time?
  • Which database properties affect the rate of
    change?
  • How to schedule updates with constrained
    resources?

8
Data for our Study 152 Web Databases
  • Randomly picked from Open Directory
  • Multiple domains
  • Multiple topics
  • Searchable (to construct summaries by querying)
  • Crawlable (to retrieve full contents)

www.wsj.com, www.intellihealth.com, www.fda.gov,
www.si.edu,
9
Data for our Study 152 Web Databases
  • Study period Oct 2002 Oct 2003
  • 52 weekly snapshots for each database
  • 5 million pages in each snapshot (approx.)
  • 65 Gb per snapshot (3.3 Tb total)
  • For each week and each database, we built
  • Complete summary (by scanning all pages)
  • Approximate summary (by query-based sampling)

10
Measuring Changes over Time
  • Recall How many words in current summary also in
    old (extracted) summary?
  • Shows how well old summaries cover the current
    (unknown) vocabulary
  • Higher values are better
  • Precision How many words in old (extracted)
    summary still in current summary?
  • Shows how many obsolete words exist in the old
    summaries
  • Higher values are better

Results for complete summaries (similar for
approximate)
11
Summaries over Time Conclusions
  • Databases (and their summaries) are not static
  • Quality of old summaries deteriorates over time
  • Quality decreases for both complete and
    approximate content summaries (see paper for
    details)

How often should we refresh the summaries?
12
Outline
  • Do content summaries change over time?
  • Which database properties affect the rate of
    change?
  • How to schedule updates with constrained
    resources?

13
Survival Analysis
Survival Analysis A collection of statistical
techniques for predicting the time until an
event occurs
  • Initially used to measure length of survival of
    patients under different treatments (hence the
    name)
  • Used to measure effect of different parameters
    (e.g., weight, race) on survival time
  • We want to predict time until next update and
    find database properties that affect this time

14
Survival Analysis for Summary Updates
  • Survival time of summary Time until current
    database summary is sufficiently different than
    the old one (i.e., an update is required)
  • Old summary changes at time t if
  • KL divergence(current,
    old) gt t
  • Survival analysis estimates probability that a
    database summary changes within time t

change sensitivity threshold
15
Modeling Goals
  • Goal Estimate database-specific survival time
    distribution
  • Exponential distribution S(t) exp(-?t) common
    for survival times
  • ? captures rate of change
  • Need to estimate ? for each database
  • Preferably, infer ? from database properties
    (with no training)
  • Intuitive (and wrong) approach data multiple
    regression
  • Study contains a large number of incomplete
    observations
  • Target variable S(t) typically not Gaussian

16
Survival Times and Incomplete Data
Survival times for a database
week
  • Many observations are incomplete (aka
    censored)
  • Censored data give partial information (database
    did not change)

17
Using Censored Data
X
  • By ignoring censored cases we get (under)
    estimates ? perform more update operations than
    needed
  • By using censored cases as-is we get (again)
    underestimates
  • Survival analysis extends the lifetime of
    censored cases

18
Database Properties and Survival Times
  • For our analysis, we use Cox Proportional Hazards
    Regression
  • Uses effectively censored data (i.e., database
    did not change within time T)
  • Derives effect of database properties on rate of
    change
  • E.g., if you double the size of a database, it
    changes twice as fast
  • No assumptions about the form of the survival
    function

19
Cox PH Regression Results
  • Examined effect of
  • Change-sensitivity threshold t
  • Topic
  • Size
  • Number of words
  • Differences of summaries extracted in consecutive
    weeks
  • Domain

(higher t ? longer survival)
(does not matter, except for health-related sites)
(larger databases change faster!)
(does not matter)
(sites that changed frequently in the past,
change frequently in the future)
(details in next slide)
Rate of change increases Rate of change decreases
20
Baseline Survival Functions by Domain
  • Effect of domain
  • GOV changes slower than any other domain
  • EDU changes fast in the short term, but slower in
    the long term
  • COM and other commercial sites change faster than
    the rest

21
Results of Cox PH Analysis
  • Cox PH analysis gives a formula for predicting
    the time between updates for any database
  • Rate of change depends on
  • domain
  • database size
  • history of change
  • threshold t

By knowing time between updates we can schedule
update operations better!
22
Outline
  • Do content summaries change over time?
  • Which database properties affect the rate of
    change?
  • How to schedule updates with constrained
    resources?

23
Deriving an Update Policy
  • Naïve policy
  • Updates all databases at the same time (i.e.,
    assumes identical change rates)
  • Suboptimal use of resources
  • Our policy
  • Use change rate as predicted by survival analysis
  • Exploit database-specific estimates for rate of
    change

24
Scheduling Updates
With plentiful resources, we update sites
according to their rate of change
When resources are constrained, we update less
often sites that change too frequently
25
Scheduling Results
  • Clever scheduling improves quality of summaries
    (according to KL, precision and recall)
  • Our policy allows users to select optimally
    change thresholds according to available
    resources, or vice versa. (see paper)

26
Updating Content Summaries Contributions
  • Extensive experimental study (1 year, 152
    dbases) established the need to update
    periodically statistics (summaries) for text
    databases
  • Change frequency model showed that database
    characteristics can predict time between updates
  • Scheduling algorithms devised update policies
    that exploit survival model and use efficiently
    available resources

27
Current and Future Work
  • Current
  • Compared with machine learning techniques
  • Applied technique for web crawling
  • Future
  • Apply survival analysis for refreshing db
    statistics
  • (materialized views, index statistics, )
  • Examine efficiency of survival analysis models
  • Create generative models for modeling database
    changes

28
Thank you! (?????)
  • Questions?
  • ????

29
Related Work
  • Brewington Cybenko, WWW9, Computer 2000
  • Cho Molina, VLDB 2000, SIGMOD 2000, TOIT 2003
  • Coffman, J.Scheduling, 1998
  • Olston Widom, SIGMOD 2002

30
Measuring Changes over Time
  • KL divergence How similar is the word
    distribution in old and current summaries?
  • Identical summaries KL0
  • Higher values are worse

Results for complete summaries (similar for
approximate)
Write a Comment
User Comments (0)
About PowerShow.com