Classifying and Searching "Hidden-Web" Text Databases - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Classifying and Searching "Hidden-Web" Text Databases

Description:

Classifying and Searching – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 70
Provided by: Panagioti4
Category:

less

Transcript and Presenter's Notes

Title: Classifying and Searching "Hidden-Web" Text Databases


1
Classifying and Searching "Hidden-Web" Text
Databases
  • Panos Ipeirotis

Department of Information Systems New York
University
2
Motivation?Surface Web vs. Hidden Web
  • Surface Web
  • Link structure
  • Crawlable
  • Documents indexed by search engines
  • Hidden Web
  • No link structure
  • Documents hidden in databases
  • Documents not indexed by search engines
  • Need to query each collection individually

3
Hidden-Web Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 39,270
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 1 match
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 39,270 1
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 29,022 2
as of Oct 3rd, 2005
4
Interacting With Hidden-Web Databases
  • Browsing Yahoo!-like directories
  • InvisibleWeb.com
  • SearchEngineGuide.com
  • Searching Metasearchers

Populated Manually
5
Outline of Talk
  • Classification of Hidden-Web Databases
  • Search over Hidden-Web Databases
  • Managing Changes in Hidden-Web Databases

6
Hierarchically Classifying the ACM Digital Library
ACM DL
?
7
Text Database Classification Definition
  • For a text database D and a category C
  • Coverage(D,C) number of docs in D about C
  • Specificity(D,C) fraction of docs in D about C
  • Assign a text database to a category C if
  • Database coverage for C at least Tc
  • Tc coverage threshold (e.g., gt 100 docs in C)
  • Database specificity for C at least Ts
  • Ts specificity threshold (e.g., gt 40 of docs
    in C)

8
Brute-Force Classification Strategy
  • Extract all documents from database
  • Classify documents on topic
  • (use state-of-the-art classifiers SVMs, C4.5,
    RIPPER,)
  • Classify database according to topic distribution

Problem No direct access to full contents of
Hidden-Web databases
9
Classification Goal Challenges
  • Goal
  • Discover database topic distribution
  • Challenges
  • No direct access to full contents of Hidden-Web
    databases
  • Only limited search interfaces available
  • Should not overload databases

Key observation Only queries about database
topic(s) generate large number of matches
10
Query-based Database Classification Overview
TRAIN CLASSIFIER
  1. Train document classifier
  2. Extract queries from classifier
  3. Adaptively issue queries to database
  4. Identify topic distribution based on adjusted
    number of query matches
  5. Classify database

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
11
Training a Document Classifier
TRAIN CLASSIFIER
  • Get training set (set of pre-classified
    documents)
  • Select best features to characterize documents
  • (Zipfs law information theoretic feature
    selection)
    Koller and Sahami 1996
  • Train classifier (SVM, C4.5, RIPPER, )

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
Output A black-box model for classifying
documents
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Document
?
Classifier
12
Training a Document Classifier
TRAIN CLASSIFIER
  • Get training set (set of pre-classified
    documents)
  • Select best features to characterize documents
  • (Zipfs law information theoretic feature
    selection)
    Koller and Sahami 1996
  • Train classifier (SVM, C4.5, RIPPER, )

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
Output A black-box model for classifying
documents
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Document
?
Classifier
13
Extracting Query Probes
ACM TOIS 2003
TRAIN CLASSIFIER
  • Transform classifier model into queries
  • Trivial for rule-based classifiers (RIPPER)

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Example query for Sports nba knicks
14
Querying Database with Extracted Queries
TRAIN CLASSIFIER
  • Issue each query to database to obtain number of
    matches without retrieving any documents
  • Increase coverage of rules category accordingly
    (Sports Sports 706)

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
SIGMOD 2001
ACM TOIS 2003
15
Identifying Topic Distribution from Query Results
TRAIN CLASSIFIER
Query-based estimates of topic distribution not
perfect
  • Document classifiers not perfect
  • Rules for one category match documents from other
    categories
  • Querying not perfect
  • Queries for same category might overlap
  • Queries do not match all documents in a category

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
IDENTIFY TOPIC DISTRIBUTION
Solution Learn to adjust results of query probes
CLASSIFY DATABASE
16
Confusion Matrix Adjustment of Query Probe
Results
correct class
Correct (but unknown) topic distribution
Incorrect topic distribution derived from query
probing
Real Coverage
1000
5000
50
comp sports health
comp 0.80 0.10 0.00
sports 0.08 0.85 0.04
health 0.02 0.15 0.96
Estimated Coverage
1300
4332
818
8005000
X

8042502
2075048
assigned class
This multiplication can be inverted to get a
better estimate of the real topic distribution
from the probe results
10 of sport documents match queries for
computers
17
Confusion Matrix Adjustment of Query Probe
Results
TRAIN CLASSIFIER
Coverage(D) M-1 . ECoverage(D)
EXTRACT QUERIES
Sports
nba knicks
Adjusted estimate of topic distribution
Health
Probing results
sars
QUERY DATABASE
  • M usually diagonally dominant for reasonable
    document classifiers, hence invertible
  • Compensates for errors in query-based estimates
    of topic distribution

IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
18
Classification Algorithm (Again)
TRAIN CLASSIFIER
  1. Train document classifier
  2. Extract queries from classifier
  3. Adaptively issue queries to database
  4. Identify topic distribution based on adjusted
    number of query matches
  5. Classify database

One-time process
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
For every database
CLASSIFY DATABASE
19
Experimental Setup
  • 72-node 4-level topic hierarchy from
    InvisibleWeb/Yahoo! (54 leaf nodes)
  • 500,000 Usenet articles
  • Newsgroups assigned by hand to hierarchy nodes
  • RIPPER trained with 54,000 articles (1,000
    articles per leaf), 27,000 articles to construct
    confusion matrix
  • 500 Controlled databases built using 419,000
    newsgroup articles
  • (to run detailed experiments)
  • 130 real Web databases picked from InvisibleWeb
    (first 5 under each topic)

comp.hardware
rec.music.classical
rec.photo.
20
Experimental ResultsControlled Databases
  • Accuracy (using F-measure)
  • Above 80 for most ltTc, Tsgt threshold
    combinations tried
  • Degrades gracefully with hierarchy depth
  • Confusion-matrix adjustment helps
  • Efficiency
  • Relatively small number of queries (lt500) needed
    for most threshold ltTc, Tsgt combinations tried

21
Experimental Results Web Databases
  • Accuracy (using F-measure)
  • 70 for best ltTc, Tsgt combination
  • Learned thresholds that reproduce human
    classification
  • Tested threshold choice using 3-fold cross
    validation
  • Efficiency
  • 120 queries per database on average needed for
    choice of thresholds, no documents retrieved
  • Only small part of hierarchy explored
  • Queries are short 1.5 words on average 4 words
    maximum (easily handled by most Web databases)

22
Hidden-Web Database Classification Summary
  • Handles autonomous Hidden-Web databases
    accurately and efficiently
  • 70 F-measure
  • Only 120 queries issued on average, with no
    documents retrieved
  • Handles large family of document classifiers
    (and can hence exploit future advances in
    machine learning)

23
Outline of Talk
  • Classification of Hidden-Web Databases
  • Search over Hidden-Web Databases
  • Managing Changes in Hidden-Web Databases

24
Interacting With Hidden-Web Databases
  • Browsing Yahoo!-like directories
  • Searching Metasearchers


Content not accessible through Google
NYTimesArchives


PubMed

Metasearcher
Query
USPTO
Library of Congress

25
Metasearchers Provide Access to Distributed
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 121,491
cancer 1,562,477 heart 691,360hepatitis
121,129 thrombopenia 24,826
?
?
?
PubMed
NYTimesArchives
USPTO
Databases typically do not export such summaries!
26
Extracting Representative Document Sample
Focused Sampling
  • Train a document classifier
  • Create queries from classifier
  • Adaptively issue queries to databases
  • Retrieve top-k matching documents for each query
  • Save matches for each one-word query
  • Identify topic distribution based on adjusted
    number of query matches
  • Categorize the database
  • Generate content summary from document sample

Focused sampling retrieves documents only from
topically dense areas from database
27
Sampling and Incomplete Content Summaries
Problem Summaries from small samples are highly
incomplete
Log(Frequency)
107
106
Frequency rank of 10 most frequent words in
PubMed database (95 of the words appear in less
than 0.1 of db)
10,000
.
.
103
102
Rank
2104
4104
105
  • Many words appear in relatively few documents
    (Zipfs law)

28
Sampling and Incomplete Content Summaries
Problem Summaries from small samples are highly
incomplete
Log(Frequency)
107
106
Frequency rank of 10 most frequent words in
PubMed database (95 of the words appear in lt
0.1 of db)
10,000
.
.
endocarditis 10,000 docs / 0.1
103
102
Rank
2104
4104
105
  • Many words appear in relatively few documents
    (Zipfs law)
  • Low-frequency words are often important

29
Sampling and Incomplete Content Summaries
Problem Summaries from small samples are highly
incomplete
Sample300
Log(Frequency)
107
106
Frequency rank of 10 most frequent words in
PubMed database (95 of the words appear in lt
0.1 of db)
9,000
.
.
endocarditis 9,000 docs / 0.1
103
102
Rank
2104
4104
105
  • Many words appear in relatively few documents
    (Zipfs law)
  • Low-frequency words are often important
  • Small document samples miss many low-frequency
    words

30
Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size
  • Main Idea Database Classification Helps
  • Similar topics ? Similar content summaries
  • Extracted content summaries complement each other

31
Databases with Similar Topics
  • CANCERLIT contains metastasis, not found during
    sampling
  • CancerBACUP contains metastasis
  • Databases under same category have similar
    vocabularies, and can complement each other

32
Content Summaries for Categories
  • Databases under same category share similar
    vocabulary
  • Higher level category content summaries provide
    additional useful estimates
  • All estimates in category path are potentially
    useful

33
Enhancing Summaries Using Shrinkage
  • Estimates from database content summaries can be
    unreliable
  • Category content summaries are more reliable
    (based on larger samples) but less specific to
    database
  • By combining estimates from category and database
    content summaries we get better estimates

SIGMOD 2004
34
Shrinkage-based Estimations
Adjust estimate for metastasis in D ?1 0.002
?2 0.05 ?3 0.092  ?4 0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
Avoids sparse data problem and decreases
estimation risk
35
Computing Shrinkage-based Summaries
Root
Health
Cancer
D
Pr metastasis D ?1 0.002 ?2 0.05 ?3
0.092  ?4 0.000 Pr treatment D ?1
0.015 ?2 0.12 ?3 0.179  ?4 0.184
  • Automatic computation of ?i weights using an EM
    algorithm
  • Computation performed offline ? No query overhead

Avoids sparse data problem and decreases
estimation risk
36
Shrinkage Weights and Summary
new estimates
old estimates
CANCERLITShrinkage-based ?root0.02 ?health0.13 ?cancer0.20 ?cancerlit0.65
metastasis 2.5 0.2 5 9.2 0
aids 14.3 0.8 7 2 20
football 0.17 2 1 0 0
  • Shrinkage
  • Increases estimations for underestimates (e.g.,
    metastasis)
  • Decreases word-probability estimates for
    overestimates (e.g., aids)
  • it also introduces (with small probabilities)
    spurious words (e.g., football)

37
Is Shrinkage Always Necessary?
  • Shrinkage used to reduce uncertainty (variance)
    of estimations
  • Small samples of large databases ? high variance
  • In sample 10 out of 100 documents contain
    metastasis
  • In database ? out of 10,000,000 documents?
  • Small samples of small databases ? small variance
  • In sample 10 out of 100 documents contain
    metastasis
  • In database ? out of 200 documents?
  • Shrinkage less useful (or even harmful) when
    uncertainty is low

38
Adaptive Application of Shrinkage
  • Database selection algorithms assign scores to
    databases for each query
  • When word frequency estimates are uncertain,
    assigned score has high variance
  • shrinkage improves score estimates
  • When word frequency estimates are reliable,
    assigned score has small variance
  • shrinkage unnecessary

Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
Solution Use shrinkage adaptively in a query-
and database-specific manner
0
1
Database Score for a Query
39
Searching Algorithm
  1. Classify databases and extract document samples
  2. Adjust frequencies in samples

One-time process
  • For each query
  • For each database D
  • Assign score to database D (using extracted
    content summary)
  • Examine uncertainty of score
  • If uncertainty high, apply shrinkage and give new
    score else keep existing score
  • Query only top-K scoring databases

For every query
40
Results Database Selection
  • Metric R(K) ? / ?
  • X of relevant documents in the selected K
    databases
  • Y of relevant documents in the best K
    databases

For CORI (a state-of-the-art database selection
algorithm) with stemming over TREC6 testbed
41
Outline of Talk
  • Classification of Hidden-Web Databases
  • Search over Hidden-Web Databases
  • Managing Changes in Hidden-Web Databases

42
Never-update Policy
  • Naive practice construct summary once, never
    update
  • Extracted (old) summary may
  • Miss new words (from new documents)
  • Contain obsolete words (from deleted document)
  • Provide inaccurate frequency estimates

NY Times (Mar 29, 2005) Word Docs
NY Times (Oct 29, 2004) Word Docs
  • tsunami (0)
  • recount 2,302
  • grokster 2
  • tsunami 250
  • recount (0)
  • grokster 78

43
Updating Content Summaries Questions
  • Do content summaries change over time?
  • Which database properties affect the rate of
    change?
  • How to schedule updates with constrained
    resources?

ICDE 2005
44
Data for our Study 152 Web Databases
  • Study period Oct 2002 Oct 2003
  • 52 weekly snapshots for each database
  • 5 million pages in each snapshot (approx.)
  • 65 Gb per snapshot (3.3 Tb total)
  • Examined differences between snapshots
  • 1 week 5 new words, 5 old words disappeared
  • 20 weeks 20 new words, 20 old words disappeared

45
Survival Analysis
Survival Analysis A collection of statistical
techniques for predicting the time until an
event occurs
  • Initially used to measure length of survival of
    patients under different treatments (hence the
    name)
  • Used to measure effect of different parameters
    (e.g., weight, race) on survival time
  • We want to predict time until next update and
    find database properties that affect this time

46
Survival Analysis for Summary Updates
  • Survival time of summary Time until current
    database summary is sufficiently different than
    the old one (i.e., an update is required)
  • Old summary changes at time t if
  • KL divergence(current,
    old) gt t
  • Survival analysis estimates probability that a
    database summary changes within time t

change sensitivity threshold
47
Survival Times and Incomplete Data
Survival times for a database
week
  • Many observations are incomplete (aka
    censored)
  • Censored data give partial information (database
    did not change)

48
Using Censored Data
X
  • By ignoring censored cases we get (under)
    estimates ? perform more update operations than
    needed
  • By using censored cases as-is we get (again)
    underestimates
  • Survival analysis extends the lifetime of
    censored cases

49
Database Properties and Survival Times
  • For our analysis, we use Cox Proportional Hazards
    Regression
  • Uses effectively censored data (i.e., database
    did not change within time T)
  • Derives effect of database properties on rate of
    change
  • E.g., if you double the size of a database, it
    changes twice as fast
  • No assumptions about the form of the survival
    function

50
Baseline Survival Functions by Domain
  • Effect of domain
  • GOV changes slower than any other domain
  • EDU changes fast in the short term, but slower in
    the long term
  • COM and other commercial sites change faster than
    the rest

51
Results of Cox PH Analysis
  • Cox PH analysis gives a formula for predicting
    the time between updates for any database
  • Rate of change depends on
  • domain
  • database size
  • history of change
  • threshold t

By knowing time between updates we can schedule
update operations better!
52
Scheduling Updates
Database Rate of change ? average time between updates average time between updates
Database Rate of change ? 10 weeks 40 weeks
Toms Hardware 0.088 5 weeks 46 weeks
USPS 0.023 12 weeks 34 weeks
With plentiful resources, we update sites
according to their rate of change
When resources are constrained, we update less
often sites that change too frequently
53
Classification Search Overall Contributions
  • Support for browsing and searching Hidden-Web
    databases
  • No need for cooperation Work with autonomous
    Hidden-Web databases
  • Scalable and work with large number of databases
  • Not restricted to Hidden-Web databases Work
    with any searchable text database

Classification and content summary extraction
implemented and available for download at
http//sdarts.cs.columbia.edu
54
Current WorkIntegrated Access to Hidden-Web
Databases
Query good drama movies playing in newark
tomorrow
Current top Google result
as of Oct 2nd, 2005
55
Future WorkIntegrated Access to Hidden-Web
Databases
Query drama movies playing in
newark tomorrow
good
query review databases
query movie databases
query ticket databases
  • All information already available on the web
  • Review databases Rotten Tomatoes, NY Times,
    TONY,
  • Movie databases All Movie Guide, IMDB
  • Tickets Moviefone, Fandango,

56
Future WorkIntegrated Access to Hidden-Web
Databases
Query drama movies playing in newark
tomorrow
good
query review databases
query movie databases
query ticket databases
  • Challenges
  • Short term
  • Learn to interface with different databases
  • Adapt database selection algorithms
  • Long term
  • Understand semantics of query
  • Extract query plans and optimize for
    distributed execution
  • Personalization
  • Security and privacy

57
Thank you!
58
Other WorkApproximate Text Matching
VLDB01
WWW03
Matching similar strings within relational DBMS
important data resides there
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
Service B
Panos Ipirotis
Jonh Smith
Stamatopulou, Jenny
John P. McDougal
Al Dridge Rodriguez
Exact joins not enough Typing mistakes,
abbreviations, different conventions
  • Introduced algorithms for mapping approximate
    text joins into SQL
  • No need for import/export of data
  • Provides crucial building block for data cleaning
    applications
  • Identifies many interesting matches

Joint work with Divesh Srivastava, Nick Koudas
(ATT Labs-Research) and others
59
No Good Category for Database
  • General problem with supervised learning
  • Example English vs. Chinese databases
  • Devised technique to analyze if can work with
    given database
  • Find candidate textfields
  • Send top-level queries
  • Examine results construct similarity matrix
  • If matrix rank small ? Many similar pages
    returned
  • Web form is not a search interface
  • Textfield is not a keyword field
  • Database is of different language
  • Database is of an unknown topic

60
Database not Category Focused
  • Extract one content summary per topic
  • Focused queries retrieve documents about known
    topic
  • Each database is represented multiple times in
    hierarchy

61
Near Future WorkDefinition and analysis of
query-based algorithms
  • Currently query-based algorithms are evaluated
    only empirically
  • Possible to model querying process using random
    graph theory and
  • Analyze thoroughly properties of the algorithms
  • Understand better why, when, and how the
    algorithms work
  • Interested in exploring similar directions
  • Adapt hyperlink-based ranking algorithms
  • Use results in graph theory to design sampling
    algorithms

WebDB 2003
62
Crawling- vs. Query-based Classification for CNN
Sports
Efficiency Statistics
Crawling-based Crawling-based Crawling-based Query-based Query-based Query-based
Time Files Size Time Queries Size
1325min 270,202 8Gb 2min (-99.8) 112 357Kb (-99.9)
IEEE DEB March 2002
Accuracy Statistics
Crawling-based classification is classified
correctly only after downloading 70 of the
documents in CNN-Sports
63
Real Confusion Matrix for Top Node of Hierarchy
Health Sports Science Computers Arts
Health 0.753 0.018 0.124 0.021 0.017
Sports 0.006 0.171 0.021 0.016 0.064
Science 0.016 0.024 0.255 0.047 0.018
Computers 0.004 0.042 0.080 0.610 0.031
Arts 0.004 0.024 0.027 0.031 0.298
64
Adjusting Document Frequencies
  • Zipfs law empirically connects word frequency f
    and rank r
  • We know document frequency and rank r of the
    words in sample

f A (r B) c
frequency
Frequency in sample
100
rank
1 12 78 .
VLDB 2002
Rank in sample
65
Adjusting Document Frequencies
  • Zipfs law empirically connects word frequency f
    and rank r
  • We know document frequency and rank r of the
    words in sample
  • We know real document frequency f of some words
    from one-word queries

f A (r B) c
frequency
Frequency in database
rank
1 12 78 .
VLDB 2002
Rank in sample
66
Adjusting Document Frequencies
  • Zipfs law empirically connects word frequency f
    and rank r
  • We know document frequency and rank r of the
    words in sample
  • We know real document frequency f of some words
    from one-word queries
  • We use curve-fitting to estimate the absolute
    frequency of all words in sample

f A (r B) c
frequency
Estimated frequency in database
rank
1 12 78 .
VLDB 2002
67
Measuring Changes over Time
  • Recall How many words in current summary also in
    old (extracted) summary?
  • Shows how well old summaries cover the current
    (unknown) vocabulary
  • Higher values are better
  • Precision How many words in old (extracted)
    summary still in current summary?
  • Shows how many obsolete words exist in the old
    summaries
  • Higher values are better

Results for complete summaries (similar for
approximate)
68
Modeling Goals
  • Goal Estimate database-specific survival time
    distribution
  • Exponential distribution S(t) exp(-?t) common
    for survival times
  • ? captures rate of change
  • Need to estimate ? for each database
  • Preferably, infer ? from database properties
    (with no training)
  • Intuitive (and wrong) approach data multiple
    regression
  • Study contains a large number of incomplete
    observations
  • Target variable S(t) typically not Gaussian

69
Other Experiments
  • Effect of choice of document classifiers
  • RIPPER
  • C4.5
  • Naïve Bayes
  • SVM
  • Benefits of feature selection
  • Effect of search-interface heterogeneity
    Boolean vs. vector-space retrieval models
  • Effect of query-overlap elimination step
  • Over crawlable databases query-based
    classification orders of magnitude faster than
    brute-force crawling-based classification

ACM TOIS 2003
IEEE Data Engineering Bulletin 2002
Write a Comment
User Comments (0)
About PowerShow.com