Title: Classifying%20and%20Searching%20"Hidden-Web"%20Text%20Databases
1Classifying and Searching "Hidden-Web" Text
Databases
Computer Science Department Columbia University
2Motivation?Surface Web vs. Hidden Web
- Surface Web
- Link structure
- Crawlable
- Documents indexed by search engines
- Hidden Web
- No link structure
- Documents hidden in databases
- Documents not indexed by search engines
- Need to query each collection individually
3Hidden-Web Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 25,749
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 0 matches
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 25,749 0
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 26,460 172
as of Feb 10th, 2004
4Interacting With Hidden-Web Databases
- Browsing Yahoo!-like directories
- InvisibleWeb.com
- SearchEngineGuide.com
- Searching Metasearchers
Populated Manually
5Outline of Talk
- Classification of Hidden-Web Databases
- Search over Hidden-Web Databases
- SDARTS
6Hierarchically Classifying the ACM Digital Library
ACM DL
?
7Text Database Classification Definition
- For a text database D and a category C
- Coverage(D,C) number of docs in D about C
- Specificity(D,C) fraction of docs in D about C
- Assign a text database to a category C if
- Database coverage for C at least Tc
- Tc coverage threshold (e.g., gt 100 docs in C)
- Database specificity for C at least Ts
- Ts specificity threshold (e.g., gt 40 of docs
in C)
8Brute-Force Classification Strategy
- Extract all documents from database
- Classify documents on topic
- (use state-of-the-art classifiers SVMs, C4.5,
RIPPER,) - Classify database according to topic distribution
Problem No direct access to full contents of
Hidden-Web databases
9Classification Goal Challenges
- Goal
- Discover database topic distribution
- Challenges
- No direct access to full contents of Hidden-Web
databases - Only limited search interfaces available
- Should not overload databases
Key observation Only queries about database
topic(s) generate large number of matches
10Query-based Database Classification Overview
TRAIN CLASSIFIER
- Train document classifier
- Extract queries from classifier
- Adaptively issue queries to database
- Identify topic distribution based on adjusted
number of query matches - Classify database
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
11Training a Document Classifier
TRAIN CLASSIFIER
- Get training set (set of pre-classified
documents) - Select best features to characterize documents
- (Zipfs law information theoretic feature
selection)
Koller and Sahami 1996 - Train classifier (SVM, C4.5, RIPPER, )
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
Output A black-box model for classifying
documents
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Document
?
?
Classifier
12Extracting Query Probes
ACM TOIS 2003
TRAIN CLASSIFIER
- Transform classifier model into queries
- Trivial for rule-based classifiers (RIPPER)
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Example query for Sports nba knicks
13Querying Database with Extracted Queries
TRAIN CLASSIFIER
- Issue each query to database to obtain number of
matches without retrieving any documents - Increase coverage of rules category accordingly
(Sports Sports 706)
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
SIGMOD 2001
ACM TOIS 2003
14Identifying Topic Distribution from Query Results
TRAIN CLASSIFIER
Query-based estimates of topic distribution not
perfect
- Document classifiers not perfect
- Rules for one category match documents from other
categories - Querying not perfect
- Queries for same category might overlap
- Queries do not match all documents in a category
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
IDENTIFY TOPIC DISTRIBUTION
Solution Learn to adjust results of query probes
CLASSIFY DATABASE
15Confusion Matrix Adjustment of Query Probe
Results
correct class
Correct (but unknown) topic distribution
Incorrect topic distribution derived from query
probing
Real Coverage
1000
5000
50
comp sports health
comp 0.80 0.10 0.00
sports 0.08 0.85 0.04
health 0.02 0.15 0.96
Estimated Coverage
1300
4332
818
8005000
X
8042502
2075048
assigned class
This multiplication can be inverted to get a
better estimate of the real topic distribution
from the probe results
10 of sport documents match queries for
computers
16Confusion Matrix Adjustment of Query Probe
Results
TRAIN CLASSIFIER
Coverage(D) M-1 . ECoverage(D)
EXTRACT QUERIES
Sports
nba knicks
Adjusted estimate of topic distribution
Health
Probing results
sars
QUERY DATABASE
- M usually diagonally dominant for reasonable
document classifiers, hence invertible - Compensates for errors in query-based estimates
of topic distribution
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
17Classification Algorithm (Again)
TRAIN CLASSIFIER
- Train document classifier
- Extract queries from classifier
- Adaptively issue queries to database
- Identify topic distribution based on adjusted
number of query matches - Classify database
One-time process
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
For every database
CLASSIFY DATABASE
18Experimental Setup
- 72-node 4-level topic hierarchy from
InvisibleWeb/Yahoo! (54 leaf nodes) - 500,000 Usenet articles (April-May 2000)
- Newsgroups assigned by hand to hierarchy nodes
- RIPPER trained with 54,000 articles (1,000
articles per leaf), 27,000 articles to construct
confusion matrix
- 500 Controlled databases built using 419,000
newsgroup articles - (to run detailed experiments)
- 130 real Web databases picked from InvisibleWeb
(first 5 under each topic)
comp.hardware
rec.music.classical
rec.photo.
19Experimental ResultsControlled Databases
- Accuracy (using F-measure)
- Above 80 for most ltTc, Tsgt threshold
combinations tried - Degrades gracefully with hierarchy depth
- Confusion-matrix adjustment helps
- Efficiency
- Relatively small number of queries (lt500) needed
for most threshold ltTc, Tsgt combinations tried
20Experimental Results Web Databases
- Accuracy (using F-measure)
- 70 for best ltTc, Tsgt combination
- Learned thresholds that reproduce human
classification - Tested threshold choice using 3-fold cross
validation - Efficiency
- 120 queries per database on average needed for
choice of thresholds, no documents retrieved - Only small part of hierarchy explored
- Queries are short 1.5 words on average 4 words
maximum (easily handled by most Web databases)
21Other Experiments
- Effect of choice of document classifiers
- RIPPER
- C4.5
- Naïve Bayes
- SVM
- Benefits of feature selection
- Effect of search-interface heterogeneity
Boolean vs. vector-space retrieval models - Effect of query-overlap elimination step
- Over crawlable databases query-based
classification orders of magnitude faster than
brute-force crawling-based classification
ACM TOIS 2003
IEEE Data Engineering Bulletin 2002
22Hidden-Web Database Classification Summary
- Handles autonomous Hidden-Web databases
accurately and efficiently - 70 F-measure
- Only 120 queries issued on average, with no
documents retrieved - Handles large family of document classifiers
(and can hence exploit future advances in
machine learning)
23Outline of Talk
- Classification of Hidden-Web Databases
- Search over Hidden-Web Databases
- SDARTS
24Interacting With Hidden-Web Databases
- Browsing Yahoo!-like directories
- Searching Metasearchers
Content not accessible through Google
NYTimesArchives
PubMed
Metasearcher
Query
USPTO
Library of Congress
25Metasearchers Provide Access to Distributed
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 121,491
cancer 1,562,477 heart 691,360hepatitis
121,129 thrombopenia 24,826
?
?
?
PubMed
NYTimesArchives
USPTO
... thrombopenia 24,826 ...
... thrombopenia 18 ...
... thrombopenia 0 ...
26Extracting Content Summaries from Autonomous
Hidden-Web Databases
CallanConnell 2001
- Send random queries to databases
- Retrieve top matching documents
- If retrieved 300 documents then stop else go to
Step 1
Content summary contains words in sample and
document frequency of each word
- Problems
- Random sampling retrieves non-representative
documents - Frequencies in summary compressed to sample
size range - Summaries from small samples are highly incomplete
27Extracting Representative Document Sample
Problem 1 Random sampling retrieves
non-representative documents
- Train a document classifier
- Create queries from classifier
- Adaptively issue queries to databases
- Retrieve top-k matching documents for each query
- Save matches for each one-word query
- Identify topic distribution based on adjusted
number of query matches - Categorize the database
- Generate content summary from document sample
Sampling retrieves documents only from
topically dense areas from database
28Sample Frequencies vs. Actual Frequencies
Problem 2 Frequencies in summary compressed to
sample size range
PubMed (11,868,552 docs) cancer 1,562,477
heart 691,360
PubMed Sample (300 documents) cancer 45
heart 16
Sampling
Key Observation Query matches reveal frequency
information
29Adjusting Document Frequencies
- Zipfs law empirically connects word frequency f
and rank r
f A (r B) c
frequency
rank
VLDB 2002
30Adjusting Document Frequencies
- Zipfs law empirically connects word frequency f
and rank r - We know document frequency and rank r of the
words in sample
f A (r B) c
frequency
Frequency in sample
100
rank
1 12 78 .
VLDB 2002
Rank in sample
31Adjusting Document Frequencies
- Zipfs law empirically connects word frequency f
and rank r - We know document frequency and rank r of the
words in sample - We know real document frequency f of some words
from one-word queries
f A (r B) c
frequency
Frequency in database
rank
1 12 78 .
VLDB 2002
Rank in sample
32Adjusting Document Frequencies
- Zipfs law empirically connects word frequency f
and rank r - We know document frequency and rank r of the
words in sample - We know real document frequency f of some words
from one-word queries - We use curve-fitting to estimate the absolute
frequency of all words in sample
f A (r B) c
frequency
Estimated frequency in database
rank
1 12 78 .
VLDB 2002
33Actual PubMed Content Summary
PubMed content summary Number of Documents
8,691,360 (Actual 11,868,552) Category
Health, Diseases cancer 1,562,477 heart
581,506 (Actual 691,360) aids 121,491
hepatitis 73,481 (Actual
121,129) basketball 907 (Actual
1,063) cpu 598
- Extracted automatically
- 27,500 words in extracted content summary
- Fewer than 200 queries sent
- At most 4 documents retrieved per query
(heart, hepatitis, basketball not in 1-word
probes)
34Sampling and Incomplete Content Summaries
Problem 3 Summaries from small samples are
highly incomplete
Sample300
Log(Frequency)
107
106
Frequency rank of 10 most frequent words in
PubMed database
9,000
.
.
aphasia 9,000 docs / 0.1
103
102
Rank
2104
4104
105
- Many words appear in relatively few documents
(Zipfs law) - Low-frequency words are often important
- Small document samples miss many low-frequency
words
35Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size
- Main Idea Database Classification Helps
- Similar topics ? Similar content summaries
- Extracted content summaries complement each other
36Databases with Similar Topics
- CANCERLIT contains metastasis, not found
during sampling - CancerBACUP contains metastasis
- Databases under same category have similar
vocabularies, and can complement each other
37Content Summaries for Categories
- Databases under same category share similar
vocabulary - Higher level category content summaries provide
additional useful estimates - All estimates in category path are potentially
useful
38Enhancing Summaries Using Shrinkage
- Estimates from database content summaries can be
unreliable - Category content summaries are more reliable
(based on larger samples) but less specific to
database - By combining estimates from category and database
content summaries we get better estimates
SIGMOD 2004
39Shrinkage-based Estimations
Adjust estimate for metastasis in D ?1 0.002
?2 0.05 ?3 0.092 ?4 0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
Avoids sparse data problem and decreases
estimation risk
40Adaptive Application of Shrinkage
- Database selection algorithms assign scores to
databases for each query - When frequency estimates are uncertain, assigned
score is uncertain - but sometimes confidence about assigned score is
high - When confident about score, shrinkage unnecessary
Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
0
1
Database Score for a Query
41Extracting Content Summaries Problems Solved
- Problem 1 Random sampling may retrieve
non-representative documents - Solution Focus querying on topically dense
areas of the database - Problem 2 Frequencies are compressed to the
sample size range - Solution Exploit number of matches for query and
adjust estimates using curve fitting - Problem 3 Summaries based on small samples are
highly incomplete - Solution Exploit database classification and
augment summaries using samples from topically
similar databases
42Searching Algorithm
- Classify databases and extract document samples
- Adjust frequencies in samples
One-time process
- For each query
- For each database D
- Assign score to database D (using extracted
content summary) - Examine uncertainty of score
- If uncertainty high, apply shrinkage and give new
score else keep existing score - Query only top-K scoring databases
For every query
43Experimental Setup
- Two standard testbeds from TREC (Text Retrieval
Conference) - 200 databases
- 100 queries with associated human-assigned
document relevance judgments - Two sets of experiments
- Content summary quality
- Metrics precision, recall, Spearman correlation
coefficient, KL-divergence - Database selection accuracy
- Metric fraction of relevant documents for
queries in top-scored databases
SIGMOD 2004
44Experimental Results
- Content summary quality
- Shrinkage improves quality of content summaries
without increasing sample size - Frequency estimation gives accurate (within 20)
estimates of actual frequencies - Database selection accuracy
- Frequency estimation Improves performance by
20-30 - Focused sampling Improves performance by 40-50
- Adaptive application of shrinkage Improves
performance up to 100 - Shrinkage is robust Improved performance
consistently across many different configurations
45Other Experiments
- Additional data set 315 real Web databases
- Choice of database selection algorithm (CORI,
bGlOSS, Language Modeling) - Effect of stemming
- Effect of stop-word elimination
SIGMOD 2004
46Classification Search Overall Contributions
- Support for browsing and searching Hidden-Web
databases - No need for cooperation Work with autonomous
Hidden-Web databases - Scalable and work with large number of databases
- Not restricted to Hidden-Web databases Work
with any searchable text database
Classification and content summary extraction
implemented and available for download at
http//sdarts.cs.columbia.edu
47Outline of Talk
- Classification of Hidden-Web Databases
- Search over Hidden-Web Databases
- SDARTS Protocol and Toolkit for Metasearching
48SDARTS Protocol and Toolkit for Metasearching
Query
Harrisons Online
SDARTS
British Medical Journal
PubMed
Unstructured text documents
DLI2 Corpus XML documents
Local
Web
49SDARTS Protocol and Toolkit for Metasearching
- Accomplishments
- Combines the strength of existing Digital Library
protocols (SDLIP, STARTS) - Enables indexing and wrapping of local
collections of text and XML documents - Enables declarative wrapping of Hidden-Web
databases, with no programming - Extracts content summary, topical focus, and
technical level of each database - Interfaces with Open Archives Initiative, an
emerging Digital Library interoperability
protocol - Critical building block for search component of
Columbias PERSIVAL project - (5-year, 5M NSF Digital Libraries Phase 2
project) - Open source, available at http//sdarts.cs.columb
ia.edu - 1,000 downloads since Jan 2003
- Supervised and coordinated eight students during
development
ACMIEEE JCDL Conference 2001, 2002
50Current Work Updating Content Summaries
Databases are not static. Their content changes.
When should we refresh the content summary?
- Examined 150 real Web databases over 52 weeks
- Modeled changes using survival analysis
techniques (Cox proportional hazards model) - Currently developing updating algorithms
- Contact database only when necessary
- Improve quality of summaries by exploiting history
Joint work with Junghoo Cho and Alex Ntoulas
(UCLA)
51Other WorkApproximate Text Matching
VLDB01
WWW03
Matching similar strings within relational DBMS
important data resides there
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
Service B
Panos Ipirotis
Jonh Smith
Stamatopulou, Jenny
John P. McDougal
Al Dridge Rodriguez
Exact joins not enough Typing mistakes,
abbreviations, different conventions
- Introduced algorithms for mapping approximate
text joins into SQL - No need for import/export of data
- Provides crucial building block for data cleaning
applications - Identifies many interesting matches
Joint work with Divesh Srivastava, Nick Koudas
(ATT Labs-Research) and others
52Future WorkIntegrated Access to Hidden-Web
Databases
Query good drama movies playing in west
lafayette tomorrow
Current top Google result
as of April 1st , 2004
Story at Old Gold Free Press about Purdue
basketball games in Fall 2002
53Future WorkIntegrated Access to Hidden-Web
Databases
Query drama movies playing in west
lafayette tomorrow
good
query review databases
query movie databases
query ticket databases
- All information already available on the web
- Review databases Rotten Tomatoes, NY Times,
TONY, - Movie databases All Movie Guide, IMDB
- Tickets Moviefone, Fandango,
54Future WorkIntegrated Access to Hidden-Web
Databases
Query drama movies playing in west
lafayette tomorrow
good
query review databases
query movie databases
query ticket databases
- Challenges
- Short term
- Learn to interface with different databases
- Adapt database selection algorithms
- Long term
- Understand semantics of query
- Extract query plans and optimize for
distributed execution - Personalization
- Security and privacy
55Panos Ipeirotis http//www.cs.columbia.edu/pirot
- Classification and Search of Hidden-Web Databases
- P. Ipeirotis, L. Gravano, When one Sample is not
Enough Improving Text Database Selection using
Shrinkage SIGMOD 2004 - L. Gravano, P. Ipeirotis, M. Sahami QProber A
System for Automatic Classification of Hidden-Web
Databases ACM TOIS 2003 - E. Agichtein, P. Ipeirotis, L. Gravano Modelling
Query-Based Access to Text Databases WebDB 2003
- P. Ipeirotis, L. Gravano Distributed Search over
the Hidden-Web Hierarchical Database Sampling
and Selection VLDB 2002 - L. Gravano, P. Ipeirotis, M. Sahami Query- vs.
Crawling-based Classification of Searchable Web
Databases DEB 2002 - P. Ipeirotis, L. Gravano, M. Sahami Probe, Count,
and Classify Categorizing Hidden-Web Databases
SIGMOD 2001 - Approximate Text Matching
- L. Gravano, P. Ipeirotis, N. Koudas, D.
Srivastava Text Joins in an RDBMS for Web Data
Integration WWW2003 - L. Gravano, P. Ipeirotis, H.V. Jagadish, N.
Koudas, S. Muthukrishnan, D. Srivastava
Approximate String Joins in a Database (Almost)
for Free VLDB 2001 - L. Gravano, P. Ipeirotis, H.V. Jagadish, N.
Koudas, S. Muthukrishnan, D. Srivastava, L.
Pietarinen Using q-grams in a DBMS for
Approximate String Processing DEB 2001 - SDARTS Protocol Toolkit for Metasearching
- N. Green, P. Ipeirotis, L. Gravano SDLIP STARTS
SDARTS. A Protocol and Toolkit for
Metasearching JCDL 2001 - P. Ipeirotis, T. Barry, L. Gravano Extending
SDARTS Extracting Metadata from Web Databases
and Interfacing with the Open Archives Initiative
JCDL 2002
56Thank you!
57No Good Category for Database
- General problem with supervised learning
- Example English vs. Chinese databases
- Devised technique to analyze if can work with
given database - Find candidate textfields
- Send top-level queries
- Examine results construct similarity matrix
- If matrix rank small ? Many similar pages
returned - Web form is not a search interface
- Textfield is not a keyword field
- Database is of different language
- Database is of an unknown topic
58Database not Category Focused
- Extract one content summary per topic
- Focused queries retrieve documents about known
topic - Each database is represented multiple times in
hierarchy
59Near Future WorkDefinition and analysis of
query-based algorithms
- Currently query-based algorithms are evaluated
only empirically - Possible to model querying process using random
graph theory and - Analyze thoroughly properties of the algorithms
- Understand better why, when, and how the
algorithms work - Interested in exploring similar directions
- Adapt hyperlink-based ranking algorithms
- Use results in graph theory to design sampling
algorithms
WebDB 2003
60Database Selection (CORI, TREC6)
More results in Stemming/No Stemming,
CORI/LM/bGlOSS, QBS/FPS/RS/CMPL, Stopwords
613-Fold Cross-Validation
62Crawling- vs. Query-based Classification for CNN
Sports
Efficiency Statistics
Crawling-based Crawling-based Crawling-based Query-based Query-based Query-based
Time Files Size Time Queries Size
1325min 270,202 8Gb 2min (-99.8) 112 357Kb (-99.9)
IEEE DEB March 2002
Accuracy Statistics
Crawling-based classification is classified
correctly only after downloading 70 of the
documents in CNN-Sports
63Experiments Precision of Database Selection
Algorithms
Content Summary Generation Technique CORI CORI
Content Summary Generation Technique Hierarchical Flat
FP-SVM-Documents 0.270 0.170
FP-SVM-Snippets 0.200 0.183
Random Sampling 0.177
QPilot (backlinks front page) 0.050
VLDB 2002 (extended version)
64F-measure vs. Hierarchy Depth
ACM TOIS 2003
65Real Confusion Matrix for Top Node of Hierarchy
Health Sports Science Computers Arts
Health 0.753 0.018 0.124 0.021 0.017
Sports 0.006 0.171 0.021 0.016 0.064
Science 0.016 0.024 0.255 0.047 0.018
Computers 0.004 0.042 0.080 0.610 0.031
Arts 0.004 0.024 0.027 0.031 0.298
66Overlap Elimination
67No Support for Conjunctive Queries(Boolean vs.
Vector-space)