Title: Automatic Classification of Text Databases Through Query Probing
1Automatic Classification of Text Databases
Through Query Probing
- Panagiotis G. Ipeirotis
- Luis Gravano
- Columbia University
- Mehran Sahami
- E.piphany Inc.
2Search-only Text Databases
- Sources of valuable information
- Hidden behind search interfaces
- Non-crawlable
- Example Microsoft Support KB
3Interacting With Searchable Text Databases
- Searching Metasearchers
- Browsing Use Yahoo-like directories
- Browse search Category-enabled metasearchers
4Searching Text Databases Metasearchers
- Select the good databases for a query
- Evaluate the query at these databases
- Combine the query results from the databases
- Examples MetaCrawler, SavvySearch, Profusion
5Browsing Through Text Databases
- Yahoo-like web directories
- InvisibleWeb.com
- SearchEngineGuide.com
- TheBigHub.com
- Example from InvisibleWeb.com
- Computers gt Publications gt ACM DL
- Category-enabled metasearchers
- User-defined category (e.g. Recipes)
6Problem With Current Classification Approach
- Classification of databases is done manually
- This requires a lot of human effort!
7How to Classify Text Databases Automatically
Outline
- Definition of classification
- Strategies for classifying searchable databases
through query probing - Initial experiments
8Database Classification Two Definitions
- Coverage-based classification
- The database contains many documents about the
category (e.g. Basketball) - Coverage docs about this category
- Specificity-based classification
- The database contains mainly documents about this
category - Specificity docs/DB
9Database Classification An Example
- Category Basketball
- Coverage-based classification
- ESPN.com, NBA.com
- Specificity-based classification
- NBA.com, but not ESPN.com
10Categorizing a Text DatabaseTwo Problems
- Find the category of a given document
- Find the category of all the documents inside the
database
11Categorizing Documents
- Several text classifiers available
- RIPPER (ATT Research, William Cohen 1995)
- Input A set of pre-classified, labeled documents
- Output A set of classification rules
12Categorizing Documents RIPPER
- Training set Preclassified documents
- Linux as a web server Computers
- Linux vs. Windows Computers
- Jordan was the leader of Chicago Bulls Sports
- Smoking causes lung cancer Health
- Output Rule-based classifier
- IF linux THEN Computers
- IF jordan AND bulls THEN Sports
- IF lung AND cancer THEN Health
13Precision and Recall of Document Classifier
- During the training phase
- 100 documents about computers
- Computer rules matched 50 docs
- From these 50 docs 40 were about computers
- Precision 40/50 0.8
- Recall 40/100 0.4
14From Document to Database Classification
- If we know the categories of all the documents,
we are done! - But databases do not export such data!
- How can we extract this information?
15Our Approach Query Probing
- Design a small set of queries to probe the
databases - Categorize the database based on the probing
results
16Designing and Implementing Query Probes
- The probes should extract information about the
categories of the documents in the database - Start with a document classifier (RIPPER)
- Transform each rule into a query
- IF lung AND cancer THEN health ? lung cancer
- IF linux THEN computers ? linux
- Get number of matches for each query
17Three Categories and Three Databases
ACM DL
linux ? computers
jordan AND bulls ? sports
lung AND cancer ? health
NBA.com
PubMED
18Using the Results for Classification
We use the results to estimate coverage and
specificity values
19Adjusting Query Results
- Classifiers are not perfect!
- Queries do not retrieve all the documents that
belong to a category - Queries for one category match documents that
do not belong to this category - From the training phase of classifier we use
precision and recall
20Precision Recall Adjustment
- Computer-category
- Rule linux, Precision 0.7
- Rule cpu, Precision 0.9
- Recall (for all the rules) 0.4
- Probing with queries for Computers
- Query linux ? X1 matches ? 0.7X1 correct
matches - Query cpu ? X2 matches ? 0.9X2 correct matches
- From X1X2 documents found
- Expect 0.7 X10.9 X2 to be correct
- Expect (0.7 X10.9 X2)/0.4 total computer docs
21Initial Experiments
- Used a collection of 20,000 newsgroup articles
- Formed 5 categories
- Computers (comp.)
- Science (sci.)
- Hobbies (rec.)
- Society (soc. alt.atheism)
- Misc (misc.sale)
- RIPPER trained with 10,000 newsgroup articles
- Classifier 29 rules, 32 words used
- IF windows AND pc THEN Computers (precision0.75)
- IF satellite AND space THEN Science
(precision0.9)
22Web-databases Probed
- Using the newsgroup classifier we probed four web
databases - Cora (www.cora.jprc.com)
- CS Papers archive (Computers)
- American Scientist (www.amsci.org)
- Science and technology magazine (Science)
- All Outdoors (www.alloutdoors.com)
- Articles about outdoor activities (Hobbies)
- Religion Today (www.religiontoday.com)
- News and discussion about religions (Society)
23Results
- Only 29 queries per web site
- No need for document retrieval!
24Conclusions
- Easy classification using only a small number of
queries - No need for document retrieval
- Only need a result like X matches found
- Not limited to search-only databases
- Every searchable database can be classified this
way - Not limited to topical classification
25Current Issues
- Comprehensive classification scheme
- Representative training data
26Future Work
- Use a hierarchical classification scheme
- Test different search interfaces
- Boolean model
- Vector-space model
- Different capabilities
- Compare with document sampling (Callan et al.s
work SIGMOD99, adapted for the classification
task) - Study classification efficiency when documents
are accessible
27Related Work
- Gauch (JUCS 1996)
- Etzioni et al. (JIIS 1997)
- Hawking Thistlewaite (TOIS 1999)
- Callan et al. (SIGMOD 1999)
- Meng et al. (CoopIS 1999)