Automatic Classification of Text Databases Through Query Probing - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Automatic Classification of Text Databases Through Query Probing

Description:

NBA.com, but not ESPN.com. Categorizing a Text Database: Two Problems ... NBA. ACM. SPEC. We use the results to estimate coverage and specificity values ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 28

Provided by: panagi2

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Classification of Text Databases Through Query Probing

1
Automatic Classification of Text Databases
Through Query Probing

Panagiotis G. Ipeirotis
Luis Gravano
Columbia University
Mehran Sahami
E.piphany Inc.

2
Search-only Text Databases

Sources of valuable information
Hidden behind search interfaces
Non-crawlable
Example Microsoft Support KB

3
Interacting With Searchable Text Databases

Searching Metasearchers
Browsing Use Yahoo-like directories
Browse search Category-enabled metasearchers

4
Searching Text Databases Metasearchers

Select the good databases for a query
Evaluate the query at these databases
Combine the query results from the databases
Examples MetaCrawler, SavvySearch, Profusion

5
Browsing Through Text Databases

Yahoo-like web directories
InvisibleWeb.com
SearchEngineGuide.com
TheBigHub.com
Example from InvisibleWeb.com
Computers gt Publications gt ACM DL
Category-enabled metasearchers
User-defined category (e.g. Recipes)

6
Problem With Current Classification Approach

Classification of databases is done manually
This requires a lot of human effort!

7
How to Classify Text Databases Automatically
Outline

Definition of classification
Strategies for classifying searchable databases
through query probing
Initial experiments

8
Database Classification Two Definitions

Coverage-based classification
The database contains many documents about the
category (e.g. Basketball)
Coverage docs about this category
Specificity-based classification
The database contains mainly documents about this
category
Specificity docs/DB

9
Database Classification An Example

Category Basketball
Coverage-based classification
ESPN.com, NBA.com
Specificity-based classification
NBA.com, but not ESPN.com

10
Categorizing a Text DatabaseTwo Problems

Find the category of a given document
Find the category of all the documents inside the
database

11
Categorizing Documents

Several text classifiers available
RIPPER (ATT Research, William Cohen 1995)
Input A set of pre-classified, labeled documents
Output A set of classification rules

12
Categorizing Documents RIPPER

Training set Preclassified documents
Linux as a web server Computers
Linux vs. Windows Computers
Jordan was the leader of Chicago Bulls Sports
Smoking causes lung cancer Health
Output Rule-based classifier
IF linux THEN Computers
IF jordan AND bulls THEN Sports
IF lung AND cancer THEN Health

13
Precision and Recall of Document Classifier

During the training phase
100 documents about computers
Computer rules matched 50 docs
From these 50 docs 40 were about computers
Precision 40/50 0.8
Recall 40/100 0.4

14
From Document to Database Classification

If we know the categories of all the documents,
we are done!
But databases do not export such data!
How can we extract this information?

15
Our Approach Query Probing

Design a small set of queries to probe the
databases
Categorize the database based on the probing
results

16
Designing and Implementing Query Probes

The probes should extract information about the
categories of the documents in the database
Start with a document classifier (RIPPER)
Transform each rule into a query
IF lung AND cancer THEN health ? lung cancer
IF linux THEN computers ? linux
Get number of matches for each query

17
Three Categories and Three Databases
ACM DL
linux ? computers
jordan AND bulls ? sports
lung AND cancer ? health
NBA.com
PubMED
18
Using the Results for Classification
We use the results to estimate coverage and
specificity values
19
Adjusting Query Results

Classifiers are not perfect!
Queries do not retrieve all the documents that
belong to a category
Queries for one category match documents that
do not belong to this category
From the training phase of classifier we use
precision and recall

20
Precision Recall Adjustment

Computer-category
Rule linux, Precision 0.7
Rule cpu, Precision 0.9
Recall (for all the rules) 0.4
Probing with queries for Computers
Query linux ? X1 matches ? 0.7X1 correct
matches
Query cpu ? X2 matches ? 0.9X2 correct matches
From X1X2 documents found
Expect 0.7 X10.9 X2 to be correct
Expect (0.7 X10.9 X2)/0.4 total computer docs

21
Initial Experiments

Used a collection of 20,000 newsgroup articles
Formed 5 categories
Computers (comp.)
Science (sci.)
Hobbies (rec.)
Society (soc. alt.atheism)
Misc (misc.sale)
RIPPER trained with 10,000 newsgroup articles
Classifier 29 rules, 32 words used
IF windows AND pc THEN Computers (precision0.75)
IF satellite AND space THEN Science
(precision0.9)

22
Web-databases Probed

Using the newsgroup classifier we probed four web
databases
Cora (www.cora.jprc.com)
CS Papers archive (Computers)
American Scientist (www.amsci.org)
Science and technology magazine (Science)
All Outdoors (www.alloutdoors.com)
Articles about outdoor activities (Hobbies)
Religion Today (www.religiontoday.com)
News and discussion about religions (Society)

23
Results

Only 29 queries per web site
No need for document retrieval!

24
Conclusions

Easy classification using only a small number of
queries
No need for document retrieval
Only need a result like X matches found
Not limited to search-only databases
Every searchable database can be classified this
way
Not limited to topical classification

25
Current Issues

Comprehensive classification scheme
Representative training data

26
Future Work

Use a hierarchical classification scheme
Test different search interfaces
Boolean model
Vector-space model
Different capabilities
Compare with document sampling (Callan et al.s
work SIGMOD99, adapted for the classification
task)
Study classification efficiency when documents
are accessible

27
Related Work