Probe, Count, and Classify: Categorizing Hidden Web Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Probe, Count, and Classify: Categorizing Hidden Web Databases

Description:

ESPN.com, NBA.com, not KnicksTerritory.com. Specificity-based classification. NBA.com, KnicksTerritory.com, not ESPN.com. Database Classification: More Details ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 37

Provided by: panagi7

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Probe, Count, and Classify: Categorizing Hidden Web Databases

1
Probe, Count, and ClassifyCategorizing Hidden
Web Databases

Panagiotis G. Ipeirotis
Luis Gravano
Columbia University
Mehran Sahami
E.piphany Inc.

DIMACS Summer School Tutorial on New Frontiers in
Data Mining Theme Web -- Thursday, August 16,
2001
2
Surface Web vs. Hidden Web

Surface Web
Link structure
Crawlable

Hidden Web
No link structure
Documents hidden
behind search forms

3
Do We Need the Hidden Web?

Example PubMed/MEDLINE
PubMed (www.ncbi.nlm.nih.gov/PubMed) search
cancer
? 1,341,586 matches
AltaVista cancer sitewww.ncbi.nlm.nih.gov
? 21,830 matches

4
Interacting With Searchable Text Databases

Searching Metasearchers
Browsing Yahoo!-like web directories
InvisibleWeb.com
SearchEngineGuide.com
Example from InvisibleWeb.com
Health gt Publications gt PubMED

Created Manually!
5
Classifying Text Databases Automatically Outline

Definition of classification
Classification through query probing
Experiments

6
Database Classification Two Definitions

Coverage-based classification
Database contains many documents about a category
Coverage docs about this category
Specificity-based classification
Database contains mainly documents about a
category
Specificity docs/DB

7
Database Classification An Example

Category Basketball
Coverage-based classification
ESPN.com, NBA.com, not KnicksTerritory.com
Specificity-based classification
NBA.com, KnicksTerritory.com, not ESPN.com

8
Database Classification More Details

Thresholds for coverage and specificity
Tc coverage threshold (e.g., 100)
Ts specificity threshold (e.g., 0.5)

Tc, Ts editorial choices
Root
Ideal(D)

Ideal(D) set of classes for database D
Class C is in Ideal(D) if
D has enough coverage and specificity (Tc, Ts)
for C and all of Cs ancestors
and
D fails to have both enough coverage and
specificity for each child of C

SPORTS C800 S0.8
HEALTH C200 S0.2
BASEBALL S0.5
BASKETBALL S0.5
9
From Document to Database Classification

If we know the categories of all documents inside
the database, we are done!
We do not have direct access to the documents.
Databases do not export such data!
How can we extract this information?

10
Our Approach Query Probing

Train a rule-based document classifier.
Transform classifier rules into queries.
Adaptively send queries to databases.
Categorize the databases based on adjusted
number of query matches.

11
Training a Rule-based Document Classifier

Feature Selection Zipfs law pruning, followed
by information-theoretic feature selection
Koller Sahami96
Classifier Learning ATTs RIPPER Cohen 1995
Input A set of pre-classified, labeled documents
Output A set of classification rules
IF linux THEN Computers
IF jordan AND bulls THEN Sports
IF lung AND cancer THEN Health

12
Constructing Query Probes

Transform each rule into a query
IF lung AND cancer THEN health ? lung cancer
IF linux THEN computers ? linux
Send the queries to the database
Get number of matches for each query, NOT the
documents (i.e., number of documents that match
each rule)
These documents would have been classified by the
rule under its associated category!

13
Adjusting Query Results

Classifiers are not perfect!
Queries do not retrieve all the documents in a
category
Queries for one category match documents not in
this category
From the classifiers training phase we know its
confusion matrix

14
Confusion Matrix
Correct class
10 of Sports classified as Computers
X

10 of the 5000 Sports docs to Computers
Classified into
M . Coverage(D) ECoverage(D)
15
Confusion Matrix AdjustmentCompensating for
Classifiers Errors
-1
X

M is diagonally dominant, hence invertible
Coverage(D) M-1 . ECoverage(D)
Multiplication better approximates the correct
result
16
Classifying a Database

Send the query probes for the top-level
categories
Get the number of matches for each probe
Calculate Specificity and Coverage for each
category
Push the database to the qualifying categories
(with SpecificitygtTs and CoveragegtTc)
Repeat for each of the qualifying categories
Return the classes that satisfy the
coverage/specificity conditions
The result is the Approximation of the Ideal
classification

17
Real Example ACM Digital Library(Tc100, Ts0.5)
18
Experiments Data

72-node 4-level topic hierarchy from
InvisibleWeb/Yahoo! (54 leaf nodes)
500,000 Usenet articles (April-May 2000)
Newsgroups assigned by hand to hierarchy nodes
RIPPER trained with 54,000 articles (1,000
articles per leaf)
27,000 articles used to construct estimations of
the confusion matrices
Remaining 419,000 articles used to build 500
Controlled Databases of varying category mixes,
size

19
Comparison With Alternatives

DS Random sampling of documents via query probes
Callan et al., SIGMOD99
Different task Gather vocabulary statistics
We adapted it for database classification
TQ Title-based Probing
Yu et al., WISE 2000
Query probes are simply the category names

20
Experiments Metrics
Expanded(N)

Accuracy of classification results
Expanded(N) N and all descendants
Correct Expanded(Ideal(D))
Classified Expanded(Approximate(D))
Precision Correct /\ Classified/Classified
Recall Correct /\ Classified/Correct
F-measure 2.Precision.Recall/(Precision
Recall)
Cost of classification Number of queries to
database

N
21
Average F-measure, Controlled Databases
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing
22
Experimental Results Controlled Databases

Feature selection helps.
Confusion-matrix adjustment helps.
F-measure above 0.8 for most ltTc, Tsgt
combinations.
Results degrade gracefully with hierarchy depth.
Relatively small number of probes needed for most
ltTc, Tsgt combinations tried.
Also, probes are short 1.5 words on average 4
words maximum.
Both better performance and lower cost than DS
Callan et al. adaptation and TQ Yu et al.

23
Web Databases

130 real databases classified from InvisibleWeb.
Used InvisibleWebs categorization as correct.
Simple wrappers for querying (only of matches
needed).
The Ts, Tc thresholds are not known (unlike with
the Controlled databases) but implicit in the
IWeb categorization.
We can learn/validate the thresholds (tricky but
easy!).
More details in the paper!

24
Web Databases Learning Thresholds
25
Experimental ResultsWeb Databases

130 Real Web Databases.
F-measure above 0.7 for best ltTc, Tsgt combination
learned.
185 query probes per database on average needed
for classification.
Also, probes are short 1.5 words on average 4
words maximum.

26
Conclusions

Accurate classification using only a small number
of short queries
No need for document retrieval
Only need a result like X matches found
No need for any cooperation or special metadata
from databases
URL http//qprober.cs.columbia.edu

27
Current and Future Work

Build wrappers automatically
Extend to non-topical categories
Evaluate impact of varying search interfaces
(e.g., Boolean vs. ranked)
Extend to other classifiers (e.g., SVMs or
Bayesian models)
Integrate with searching (connection with
database selection?)

28
Questions?
29
Contributions

Easy, inexpensive method for database
classification
Uses results from document classification
Indirect classification of the documents in a
database
Does not inspect documents, only number of
matches
Adjustment of results according to classifiers
performance
Easy wrapper construction
No need for any metadata from the database

30
Related Work

Callan et al., SIGMOD 1999
Gauch et al., Profusion
Dolin et al., Pharos
Yu et al., WISE 2000
Raghavan and Garcia Molina, VLDB 2001

31
Controlled Databases

500 databases built using 419,000 newsgroup
articles
One label per document
350 databases with single (not necessarily leaf)
category
150 databases with varying category mixes
Database size ranges from 25 to 25,000 articles
Indexed and queries using SMART

32
F-measure for Different Hierarchy Depths
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing Tc8, Ts0.3
33
Query Probes Per Controlled Database
34
Web Databases Number of Query Probes
35
3-fold Cross-validation
36
Real Confusion Matrix for Top Node of Hierarchy

Write a Comment

User Comments (0)