Probe, Count, and Classify: Categorizing Hidden Web Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Probe, Count, and Classify: Categorizing Hidden Web Databases

Description:

ESPN.com, NBA.com, not KnicksTerritory.com. Specificity-based classification. NBA.com, KnicksTerritory.com, not ESPN.com. Database Classification: More Details ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 37
Provided by: panagi7
Category:

less

Transcript and Presenter's Notes

Title: Probe, Count, and Classify: Categorizing Hidden Web Databases


1
Probe, Count, and ClassifyCategorizing Hidden
Web Databases
  • Panagiotis G. Ipeirotis
  • Luis Gravano
  • Columbia University
  • Mehran Sahami
  • E.piphany Inc.

DIMACS Summer School Tutorial on New Frontiers in
Data Mining Theme Web -- Thursday, August 16,
2001
2
Surface Web vs. Hidden Web
  • Surface Web
  • Link structure
  • Crawlable
  • Hidden Web
  • No link structure
  • Documents hidden
  • behind search forms

3
Do We Need the Hidden Web?
  • Example PubMed/MEDLINE
  • PubMed (www.ncbi.nlm.nih.gov/PubMed) search
    cancer
  • ? 1,341,586 matches
  • AltaVista cancer sitewww.ncbi.nlm.nih.gov
  • ? 21,830 matches

4
Interacting With Searchable Text Databases
  • Searching Metasearchers
  • Browsing Yahoo!-like web directories
  • InvisibleWeb.com
  • SearchEngineGuide.com
  • Example from InvisibleWeb.com
  • Health gt Publications gt PubMED

Created Manually!
5
Classifying Text Databases Automatically Outline
  • Definition of classification
  • Classification through query probing
  • Experiments

6
Database Classification Two Definitions
  • Coverage-based classification
  • Database contains many documents about a category
    Coverage docs about this category
  • Specificity-based classification
  • Database contains mainly documents about a
    category
  • Specificity docs/DB

7
Database Classification An Example
  • Category Basketball
  • Coverage-based classification
  • ESPN.com, NBA.com, not KnicksTerritory.com
  • Specificity-based classification
  • NBA.com, KnicksTerritory.com, not ESPN.com

8
Database Classification More Details
  • Thresholds for coverage and specificity
  • Tc coverage threshold (e.g., 100)
  • Ts specificity threshold (e.g., 0.5)

Tc, Ts editorial choices
Root
Ideal(D)
  • Ideal(D) set of classes for database D
  • Class C is in Ideal(D) if
  • D has enough coverage and specificity (Tc, Ts)
    for C and all of Cs ancestors
  • and
  • D fails to have both enough coverage and
    specificity for each child of C

SPORTS C800 S0.8
HEALTH C200 S0.2
BASEBALL S0.5
BASKETBALL S0.5
9
From Document to Database Classification
  • If we know the categories of all documents inside
    the database, we are done!
  • We do not have direct access to the documents.
  • Databases do not export such data!
  • How can we extract this information?

10
Our Approach Query Probing
  • Train a rule-based document classifier.
  • Transform classifier rules into queries.
  • Adaptively send queries to databases.
  • Categorize the databases based on adjusted
    number of query matches.

11
Training a Rule-based Document Classifier
  • Feature Selection Zipfs law pruning, followed
    by information-theoretic feature selection
    Koller Sahami96
  • Classifier Learning ATTs RIPPER Cohen 1995
  • Input A set of pre-classified, labeled documents
  • Output A set of classification rules
  • IF linux THEN Computers
  • IF jordan AND bulls THEN Sports
  • IF lung AND cancer THEN Health

12
Constructing Query Probes
  • Transform each rule into a query
  • IF lung AND cancer THEN health ? lung cancer
  • IF linux THEN computers ? linux
  • Send the queries to the database
  • Get number of matches for each query, NOT the
    documents (i.e., number of documents that match
    each rule)
  • These documents would have been classified by the
    rule under its associated category!

13
Adjusting Query Results
  • Classifiers are not perfect!
  • Queries do not retrieve all the documents in a
    category
  • Queries for one category match documents not in
    this category
  • From the classifiers training phase we know its
    confusion matrix

14
Confusion Matrix
Correct class
10 of Sports classified as Computers
X

10 of the 5000 Sports docs to Computers
Classified into
M . Coverage(D) ECoverage(D)
15
Confusion Matrix AdjustmentCompensating for
Classifiers Errors
-1
X

M is diagonally dominant, hence invertible
Coverage(D) M-1 . ECoverage(D)
Multiplication better approximates the correct
result
16
Classifying a Database
  • Send the query probes for the top-level
    categories
  • Get the number of matches for each probe
  • Calculate Specificity and Coverage for each
    category
  • Push the database to the qualifying categories
  • (with SpecificitygtTs and CoveragegtTc)
  • Repeat for each of the qualifying categories
  • Return the classes that satisfy the
    coverage/specificity conditions
  • The result is the Approximation of the Ideal
    classification

17
Real Example ACM Digital Library(Tc100, Ts0.5)
18
Experiments Data
  • 72-node 4-level topic hierarchy from
    InvisibleWeb/Yahoo! (54 leaf nodes)
  • 500,000 Usenet articles (April-May 2000)
  • Newsgroups assigned by hand to hierarchy nodes
  • RIPPER trained with 54,000 articles (1,000
    articles per leaf)
  • 27,000 articles used to construct estimations of
    the confusion matrices
  • Remaining 419,000 articles used to build 500
    Controlled Databases of varying category mixes,
    size

19
Comparison With Alternatives
  • DS Random sampling of documents via query probes
  • Callan et al., SIGMOD99
  • Different task Gather vocabulary statistics
  • We adapted it for database classification
  • TQ Title-based Probing
  • Yu et al., WISE 2000
  • Query probes are simply the category names

20
Experiments Metrics
Expanded(N)
  • Accuracy of classification results
  • Expanded(N) N and all descendants
  • Correct Expanded(Ideal(D))
  • Classified Expanded(Approximate(D))
  • Precision Correct /\ Classified/Classified
  • Recall Correct /\ Classified/Correct
  • F-measure 2.Precision.Recall/(Precision
    Recall)
  • Cost of classification Number of queries to
    database

N
21
Average F-measure, Controlled Databases
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing
22
Experimental Results Controlled Databases
  • Feature selection helps.
  • Confusion-matrix adjustment helps.
  • F-measure above 0.8 for most ltTc, Tsgt
    combinations.
  • Results degrade gracefully with hierarchy depth.
  • Relatively small number of probes needed for most
    ltTc, Tsgt combinations tried.
  • Also, probes are short 1.5 words on average 4
    words maximum.
  • Both better performance and lower cost than DS
    Callan et al. adaptation and TQ Yu et al.

23
Web Databases
  • 130 real databases classified from InvisibleWeb.
  • Used InvisibleWebs categorization as correct.
  • Simple wrappers for querying (only of matches
    needed).
  • The Ts, Tc thresholds are not known (unlike with
    the Controlled databases) but implicit in the
    IWeb categorization.
  • We can learn/validate the thresholds (tricky but
    easy!).
  • More details in the paper!

24
Web Databases Learning Thresholds
25
Experimental ResultsWeb Databases
  • 130 Real Web Databases.
  • F-measure above 0.7 for best ltTc, Tsgt combination
    learned.
  • 185 query probes per database on average needed
    for classification.
  • Also, probes are short 1.5 words on average 4
    words maximum.

26
Conclusions
  • Accurate classification using only a small number
    of short queries
  • No need for document retrieval
  • Only need a result like X matches found
  • No need for any cooperation or special metadata
    from databases
  • URL http//qprober.cs.columbia.edu

27
Current and Future Work
  • Build wrappers automatically
  • Extend to non-topical categories
  • Evaluate impact of varying search interfaces
    (e.g., Boolean vs. ranked)
  • Extend to other classifiers (e.g., SVMs or
    Bayesian models)
  • Integrate with searching (connection with
    database selection?)

28
Questions?
29
Contributions
  • Easy, inexpensive method for database
    classification
  • Uses results from document classification
  • Indirect classification of the documents in a
    database
  • Does not inspect documents, only number of
    matches
  • Adjustment of results according to classifiers
    performance
  • Easy wrapper construction
  • No need for any metadata from the database

30
Related Work
  • Callan et al., SIGMOD 1999
  • Gauch et al., Profusion
  • Dolin et al., Pharos
  • Yu et al., WISE 2000
  • Raghavan and Garcia Molina, VLDB 2001

31
Controlled Databases
  • 500 databases built using 419,000 newsgroup
    articles
  • One label per document
  • 350 databases with single (not necessarily leaf)
    category
  • 150 databases with varying category mixes
  • Database size ranges from 25 to 25,000 articles
  • Indexed and queries using SMART

32
F-measure for Different Hierarchy Depths
PnC Probe Count, DSDocument Sampling,
TQTitle-based probing Tc8, Ts0.3
33
Query Probes Per Controlled Database
34
Web Databases Number of Query Probes
35
3-fold Cross-validation
36
Real Confusion Matrix for Top Node of Hierarchy
Write a Comment
User Comments (0)
About PowerShow.com