Title: Probe, Count, and Classify: Categorizing Hidden Web Databases
1Probe, Count, and ClassifyCategorizing Hidden
Web Databases
- Panagiotis G. Ipeirotis
- Luis Gravano
- Columbia University
- Mehran Sahami
- E.piphany Inc.
DIMACS Summer School Tutorial on New Frontiers in
Data Mining Theme Web -- Thursday, August 16,
2001
2Surface Web vs. Hidden Web
- Surface Web
- Link structure
- Crawlable
- Hidden Web
- No link structure
- Documents hidden
- behind search forms
3Do We Need the Hidden Web?
- Example PubMed/MEDLINE
- PubMed (www.ncbi.nlm.nih.gov/PubMed) search
cancer - ? 1,341,586 matches
- AltaVista cancer sitewww.ncbi.nlm.nih.gov
- ? 21,830 matches
4Interacting With Searchable Text Databases
- Searching Metasearchers
- Browsing Yahoo!-like web directories
- InvisibleWeb.com
- SearchEngineGuide.com
- Example from InvisibleWeb.com
- Health gt Publications gt PubMED
Created Manually!
5Classifying Text Databases Automatically Outline
- Definition of classification
- Classification through query probing
- Experiments
6Database Classification Two Definitions
- Coverage-based classification
- Database contains many documents about a category
Coverage docs about this category - Specificity-based classification
- Database contains mainly documents about a
category - Specificity docs/DB
7Database Classification An Example
- Category Basketball
- Coverage-based classification
- ESPN.com, NBA.com, not KnicksTerritory.com
- Specificity-based classification
- NBA.com, KnicksTerritory.com, not ESPN.com
8Database Classification More Details
- Thresholds for coverage and specificity
- Tc coverage threshold (e.g., 100)
- Ts specificity threshold (e.g., 0.5)
Tc, Ts editorial choices
Root
Ideal(D)
- Ideal(D) set of classes for database D
- Class C is in Ideal(D) if
- D has enough coverage and specificity (Tc, Ts)
for C and all of Cs ancestors - and
- D fails to have both enough coverage and
specificity for each child of C
SPORTS C800 S0.8
HEALTH C200 S0.2
BASEBALL S0.5
BASKETBALL S0.5
9From Document to Database Classification
- If we know the categories of all documents inside
the database, we are done! - We do not have direct access to the documents.
- Databases do not export such data!
- How can we extract this information?
10Our Approach Query Probing
- Train a rule-based document classifier.
- Transform classifier rules into queries.
- Adaptively send queries to databases.
- Categorize the databases based on adjusted
number of query matches.
11Training a Rule-based Document Classifier
- Feature Selection Zipfs law pruning, followed
by information-theoretic feature selection
Koller Sahami96 - Classifier Learning ATTs RIPPER Cohen 1995
- Input A set of pre-classified, labeled documents
- Output A set of classification rules
- IF linux THEN Computers
- IF jordan AND bulls THEN Sports
- IF lung AND cancer THEN Health
12Constructing Query Probes
- Transform each rule into a query
- IF lung AND cancer THEN health ? lung cancer
- IF linux THEN computers ? linux
- Send the queries to the database
- Get number of matches for each query, NOT the
documents (i.e., number of documents that match
each rule) - These documents would have been classified by the
rule under its associated category!
13Adjusting Query Results
- Classifiers are not perfect!
- Queries do not retrieve all the documents in a
category - Queries for one category match documents not in
this category - From the classifiers training phase we know its
confusion matrix
14Confusion Matrix
Correct class
10 of Sports classified as Computers
X
10 of the 5000 Sports docs to Computers
Classified into
M . Coverage(D) ECoverage(D)
15Confusion Matrix AdjustmentCompensating for
Classifiers Errors
-1
X
M is diagonally dominant, hence invertible
Coverage(D) M-1 . ECoverage(D)
Multiplication better approximates the correct
result
16Classifying a Database
- Send the query probes for the top-level
categories - Get the number of matches for each probe
- Calculate Specificity and Coverage for each
category - Push the database to the qualifying categories
- (with SpecificitygtTs and CoveragegtTc)
- Repeat for each of the qualifying categories
- Return the classes that satisfy the
coverage/specificity conditions - The result is the Approximation of the Ideal
classification
17Real Example ACM Digital Library(Tc100, Ts0.5)
18Experiments Data
- 72-node 4-level topic hierarchy from
InvisibleWeb/Yahoo! (54 leaf nodes) - 500,000 Usenet articles (April-May 2000)
- Newsgroups assigned by hand to hierarchy nodes
- RIPPER trained with 54,000 articles (1,000
articles per leaf) - 27,000 articles used to construct estimations of
the confusion matrices - Remaining 419,000 articles used to build 500
Controlled Databases of varying category mixes,
size
19Comparison With Alternatives
- DS Random sampling of documents via query probes
- Callan et al., SIGMOD99
- Different task Gather vocabulary statistics
- We adapted it for database classification
- TQ Title-based Probing
- Yu et al., WISE 2000
- Query probes are simply the category names
20Experiments Metrics
Expanded(N)
- Accuracy of classification results
- Expanded(N) N and all descendants
- Correct Expanded(Ideal(D))
- Classified Expanded(Approximate(D))
- Precision Correct /\ Classified/Classified
- Recall Correct /\ Classified/Correct
- F-measure 2.Precision.Recall/(Precision
Recall) - Cost of classification Number of queries to
database
N
21Average F-measure, Controlled Databases
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing
22Experimental Results Controlled Databases
- Feature selection helps.
- Confusion-matrix adjustment helps.
- F-measure above 0.8 for most ltTc, Tsgt
combinations. - Results degrade gracefully with hierarchy depth.
- Relatively small number of probes needed for most
ltTc, Tsgt combinations tried. - Also, probes are short 1.5 words on average 4
words maximum. - Both better performance and lower cost than DS
Callan et al. adaptation and TQ Yu et al.
23Web Databases
- 130 real databases classified from InvisibleWeb.
- Used InvisibleWebs categorization as correct.
- Simple wrappers for querying (only of matches
needed). - The Ts, Tc thresholds are not known (unlike with
the Controlled databases) but implicit in the
IWeb categorization. - We can learn/validate the thresholds (tricky but
easy!). - More details in the paper!
24Web Databases Learning Thresholds
25Experimental ResultsWeb Databases
- 130 Real Web Databases.
- F-measure above 0.7 for best ltTc, Tsgt combination
learned. - 185 query probes per database on average needed
for classification. - Also, probes are short 1.5 words on average 4
words maximum.
26Conclusions
- Accurate classification using only a small number
of short queries - No need for document retrieval
- Only need a result like X matches found
- No need for any cooperation or special metadata
from databases - URL http//qprober.cs.columbia.edu
27Current and Future Work
- Build wrappers automatically
- Extend to non-topical categories
- Evaluate impact of varying search interfaces
(e.g., Boolean vs. ranked) - Extend to other classifiers (e.g., SVMs or
Bayesian models) - Integrate with searching (connection with
database selection?)
28Questions?
29Contributions
- Easy, inexpensive method for database
classification - Uses results from document classification
- Indirect classification of the documents in a
database - Does not inspect documents, only number of
matches - Adjustment of results according to classifiers
performance - Easy wrapper construction
- No need for any metadata from the database
30Related Work
- Callan et al., SIGMOD 1999
- Gauch et al., Profusion
- Dolin et al., Pharos
- Yu et al., WISE 2000
- Raghavan and Garcia Molina, VLDB 2001
31Controlled Databases
- 500 databases built using 419,000 newsgroup
articles - One label per document
- 350 databases with single (not necessarily leaf)
category - 150 databases with varying category mixes
- Database size ranges from 25 to 25,000 articles
- Indexed and queries using SMART
32F-measure for Different Hierarchy Depths
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing Tc8, Ts0.3
33Query Probes Per Controlled Database
34Web Databases Number of Query Probes
353-fold Cross-validation
36Real Confusion Matrix for Top Node of Hierarchy