Title: Probe, Count, and Classify: Categorizing HiddenWeb Databases
1Probe, Count, and ClassifyCategorizing
Hidden-Web Databases
- SIGMOD 2001,
- Panagiotis G. Ipeirotis
- Computer Science Dept. Columbia University
- DB Lab.
- Hee-Jeon Lee
21. INTRODUCTION (1)
- Ordinary web
- Traditional web, crawlable
- 2 billion pages
- Hidden Web
- Only accessible through search interfaces
- Cohesive, Higher quality, not crawlable
- 500 billion pages
31. INTRODUCTION (2)
- Example Query with the keyword cancer
- - PubMed (http//ncbi.nlm.nih.gov/PubMed/
) - - Search engine on AltaVista
- The problem of accurate information retrieval in
WWW - Retrieve static document
- Determine searchable databases
- Searchable Web databases
- Is collection of text documents that is
searchable through a web-accessible search
interface - Focus is on text
41. INTRODUCTION (3)
- Manually classifying searchable web databases
(Yahoo!-like hierarchical categorization scheme) - Lack of scalability
-
-
Automating classification process -
combination 1) machine learning 2) database
querying techniques - transform the rules
of the classifier into a set of query
probes
52. TEXT-DATABASE CLASSIFICATION
- Organize the space of searchable databases in a
hierarchical categorization scheme. - Sec 2.1
- Define appropriate classification schemes
- Sec 2.2
- Alternative methods
62.1 Classification Schemes for Databases
72.2 Alternative Classification Strategies (1)
- To assign a searchable web database to category
- Manually inspect the contents of the database and
make a decision based on the results of
inspection - A less manual approach
- Coverage-based classification
- Specificity-based classification
- Specificity-based
- CBS SportsLine
- Coverage-based
- CBS SportsLine - Basketball
82.2 Alternative Classification Strategies (2)
92.2 Alternative Classification Strategies (3)
103. CLASSIFYING DATABASE THROUGH PROBING
- How can approximate information for a given
database without accessing its contents - Sec 3.1
- Train a rule-based document classifier with a set
of preclassified documents. - Sec 3.2
- Transform classifier rules into queries.
- Sec 3.3
- Adaptively issue queries to databases, extracting
and adjusting the number of matches for each
query using the classifiers confusion matrix. - Sec 3.4
- Classify databases using the adjusted number of
query matches.
113.1 Training a Document Classifier (1)
- Rely on a rule-based document classifier to
create the probing queries - Use supervised learning to construct a rule-based
classifier from a set of preclassified documents - The resulting classifier is a set of logical rules
- Antecedents are conjunctions
- of words.
- Consequents are the category
- assignments for each document.
123.1 Training a Document Classifier (2)
- To define a document classifier over an entire
hierarchical classification scheme, train one
flat rule-based document classifier for each
internal node of the hierarchy.
133.1 Training a Document Classifier (3)
- To produce a rule-based document classifier
- 1. Feature selection
- To eliminate from the training set all words that
appear very frequently in the training documents,
as well as very infrequently appearing words. - Eliminates the terms that have the least impact
on the class distribution of documents - 2. Classify the database according to the number
of documents that it contains in each category
143.2 Defining Query Probes from a Document
Classifier (1)
- Query probe
- Will help estimate the number of documents for
each category of interest in a searchable web
database
153.2 Defining Query Probes from a Document
Classifier (2)
- Map the rule into the Boolean query
- IF jordan AND bulls THEN Sports -gt jordan AND
bulls
163.2 Defining Query Probes from a Document
Classifier (3)
Boolean query
173.3 Adjusting Probing Results (1)
- Confusion matrix
- Need to adjust initial probing results to account
for potential errors - In the machine learning community to report the
document classification results
183.3 Adjusting Probing Results (2)
193.3 Adjusting Probing Results (3)
- diagonally dominant matrix
- Gershgorin disk theorem
203.4 Using Probing Results for Classification (1)
- Classify database in a top-to-bottom way
- 1. Each database is first classified by
root-level classifier - 2. recursively push down to the lower level
classifiers
213.4 Using Probing Results for Classification (2)
223.4 Using Probing Results for Classification (3)
234. EXPERIMENTAL SETTING
- Sec 4.1 Data Collections
- Controlled databases
- Homogeneous
- Heterogeneous
- Real web databases
- Sec 4.2 Techniques for Comparison
- Probe and Count (PnC)
- Document Sampling (DS)
- Query probing to automatically construct a
language model of a text database - Title-based Querying (TQ)
- One long query for each category using the title
of the category itself augmented by the titles of
all its subcategories
244.3 Evaluation Metrics (1)
254.3 Evaluation Metrics (2)
264.3 Evaluation Metrics (3)
- Correct Expanded (programming)
- Classified Expanded (Java)
Java / Java 1 / 1 1
Java / Prog.., C.., Pe.., Java, Visu..
1 / 5
275. EXPERIMENTAL RESULTS
- Sec 5.1 Tuning the PnC Technique
- Effect of Confusion Matrix Adjustment (CMA)
- Effect of Feature Selection
- ECoverage estimates with FS on were between 15
and 20 better - ESpecificity estimates with FS on were around
10 better
285.2 Results over the Controlled Databases
295.3 Results over the Web Databases
306. CONCLUSIONS AND FUTURE WORK
- Have presented a novel and efficient method for
the hierarchical classification of text databases
on the web. - Would completely automate the classification
process is to eliminate the need for a human to
construct the simple wrapper for each database to
classify. - Can be eliminated by automatically learning how
to parse the pages with query results