Title: Automatic Categorization of Query Results
1Automatic Categorization of Query Results
- Kaushik Chakrabarti Surajit Chaudhuri
Seung-won Hwang - Data Management Exploration and Mining Group
- Microsoft Research
- SIGMOD 2004
2Exploratory Queries
Database systems increasingly used for
interactive, exploratory retrieval
Home Searching Application
Job Searching Application
Keyword Search over Knowledge Bases
Product Catalog Searching
3Information Overload
- In exploratory retrieval, queries could return
too many answers - Tiny fraction of retrieved items relevant to user
? finding them leads to wastage of time and
effort - Causes
- User is not certain of what she is looking for
(e.g., looking for medium-priced homes in
Seattle area but unable to specify exact
price/neighborhood) - User is naïve and refrains from using advanced
search features - Browsing instead of searching
- Manual query reformulation is difficult and
time-consuming
4Reducing Information Overload
- Internet Text Search
- Categorization Group search results into labeled
categories user explores just the relevant
categories and ignores the rest - Ranking Place the most relevant answers at the
top users examines the first 10-20 results and
ignores the rest - Database Systems
- Ranking Received some attention recently (last
3-4 years) - Categorization Focus of this paper
5Problems with Existing Techniques
- Category structure created a-priori (typically a
manual process) - Items tagged (assigned categories) a-priori
(manual/semi-automatic) - At search time each search result placed under
pre-assigned category - Susceptible to skew defeats purpose of
categorization
6Our solution
- Categorize results of SQL queries automatically
- Generate labeled, hierarchical category structure
dynamically based on the contents of the tuples
in the result set - Does not suffer from problems as in a-priori
categorization
7Outline of talk
- Exploration/Cost Models to quantify information
overload faced by an user during an exploration - Heuristic Algorithms to find good (low cost)
categorizations - Experiments to evaluate models/algorithms
8Hierarchical Categorization
Recursive partitioning of tuples in the result
set based on data attributes and its values
Find homes in Seattle/Bellevue Area of
Washington, USA in the 200K - 300K price range
9Which categorization is the best?
- Many possible categorizations for a given result
set at each level, any distinct choice of
categorizing attribute and partitioning generates
a different categorization
- We want to choose the categorization that
minimizes the information overload on the user - In order to quantify information overload, we
first need a model that captures how a user
navigates the result set using a given tree
10Exploration Model
- Defines how the user navigates the result set
using a given tree T - 2 scenarios
- ALL Scenario User explores the result set using
the category tree until she finds every tuple
relevant to her - ONE/FEW Scenario User explores the result set
until she finds one/few relevant tuple
Given user has decided to explore C, how does she
explore C?
Option SHOWTUPLES Browse through all tuples in
tset(C)
Ignore C1 or Explore C1
Ignore C2 or Explore C2
Option SHOWCAT Examine labels of all
subcategories, exploring the ones relevant to her
and ignoring the rest
Ignore Cn or Explore Cn
11An Example Exploration
Entire category tree
An exploration on the tree
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
IGNORE
12Cost of an Exploration
Information Overload Cost (or simply Cost)
Cost(X,T) of a given user exploration X using a
given category tree T is the total number of
items (which includes both category labels and
data tuples) examined by the user during X.
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
Say tset(C) for this category is 20, then
Cost(X,T) 13320 27
IGNORE
13Average Cost
- If we knew the users exploration X is advance,
we could generate the tree T that minimizes the
cost Cost(X,T) for the exploration - Since we do not know that, we use aggregate
knowledge of previous user behavior to estimate
the cost Cost(T) that a user will face, on
average, during an exploration using a given
category tree T. - Cost(T) is the number of items (which includes
category labels and data tuples) that a user will
need to examine, on average, during the
exploration of the result set using T. - Categorization Problem Find the tree Tmin
argminT Cost(T)
14Computing the Average Cost
- We need to know two probabilities associated with
each node C - Exploration Probability P(C) P(User explores C
User examines label of C) - SHOWTUPLES Probability Pw(C) P(User chooses
SHOWTUPLES for C User explores C)
- Consider a non-leaf node C of T.
- Let Cost(C) denote cost of exploring subtree
rooted at C given that the user has chosen to
explore C - Cost(C) SHOWTUP probSHOWTUP cost SHOWCAT
probSHOWCAT cost - Pw(C) tset(C)
(1-Pw(C)) (Kn
P(Ci)Cost(Ci)) - Cost(T) is simply Cost(root)
15Estimating Probabilities
- We use aggregate knowledge of previous user
behavior - Specifically, we look at the log of queries that
users of this application have asked in the past
(easy to obtain)
16Estimating SHOWTUPLES Probability
SHOWTUPLES Probability of C P(User chooses
SHOWTUPLES for C User explores C)
Given user explores C, she has two choices
User chooses this if subcategorizing attribute
SA(C) is such that she is interested in only a
few subcategories (i.e., few values of SA(C))
User chooses this if she is interested in all or
most of the subcategories (i.e., all or most
values of SA(C))
17Estimating Exploration Probability
- Exploration probability P(C) P(User explores C
User examines label of C)
18Estimating Exploration Probability
- Assuming users interest in different attributes
to be mutually independent, - P(C)
- Workload query Wi represents information need of
user Ui - User Ui interested in label Price200K-225K iff
Wi contains selection condition on Price and that
condition overlaps with label predicate
200K-225K - P(C)
19Outline of talk
- Exploration/Cost Models to quantify information
overload faced by an user during an exploration - Heuristic Algorithms to find good (low cost)
categorizations - Experiments to evaluate models/algorithms
20Categorization Algorithm
- Categorization Problem Find the tree Tmin
argminT Cost(T) - Enumerative algorithm will produce optimal tree
but is prohibitively expensive - Heuristic techniques to reduce search space
(1-level case first) - Eliminate unattractive attributes without
considering any of their partitionings - For each attribute retained, obtain a good
partitioning efficiently (without enumerating all
possible partitionings) - Choose the attribute-partitioning combination
with the least cost.
21Attribute Elimination
- Attributes that occur infrequently in selection
conditions in the queries in the workload can be
discarded right away - Categorizing Attribute is low occurring ?
SHOWTUPLES probability is high - Typically, SHOWTUPLES cost gtgt SHOWCAT cost
- Categorizing Attribute is low occurring ? high
cost irrespective of partitioning - Eliminate attributes that occurrence fraction lt x
22Partitioning for Categorical Attributes
- We consider only single-value partitionings
- We only need to find the right ordering
- Among all orderings, cost is minimum when the
categories are presented in increasing order of
1/P(Ci) Cost(Ci) - Intuition
- If user is likely to drill into a category,
present it earlier - If the exploration cost is low, place it earlier
- Cost(Ci) is complex to compute in the multilevel
case - We use the following simple heuristic present
categories in decreasing order of P(Ci)
23Partitioning for Numeric Attributes
- Consider the simple case where we want to
partition the range (vmin, vmax into two buckets - We want to identify the best point to split
- If a lot of query ranges begin/end at v, v is a
good point to split - Goodness score of a splitpoint v SUM(startv,
endv) - Heuristic Choose the (m-1) best splitpoints
(based on goodness scores) that are necessary - Above algorithm produces cost-optimal
partitioning in some special cases but good
partitionings in general.
24Multilevel Categorization
- At each level l
- Determine the categorizing attribute A
- For each category C in level (l-1) with
tset(C)gtM, partition C using A - Detailed algorithm in paper
25Outline of talk
- Exploration/Cost Models to quantify information
overload faced by an user during an exploration - Heuristic Algorithms to find good (low cost)
categorizations - Experiments to evaluate models/algorithms
26Experiments
- Goals
- Evaluate accuracy of our cost models
- Evaluate our cost-based categorization algorithm
and compare it with techniques that do not
consider such cost models - Dataset
- Home listing database (1.7 million homes) from
MSN HouseHome - Attributes neighborhood, price, bedroomcount,
bathcount, year-built, property-type,
square-footage - Workload
- 176,262 SQL query strings representing searches
conducted by real buyers on MSN HouseHome web
site - 2 studies
- Large scale, simulated, cross-validated user
study - Real-life User Study
- Compare our cost-based algorithm to No Cost and
Attr-cost techniques
27Large-scale Simulated User Study
- Treat a workload query as a user exploration
- Broaden the exploration to generate the user
query for which the tree is generated - 8-fold cross validation (each subset has 100
queries) - Strong positive correlation between actual cost
and estimated average cost - Cost-based techniques about 5 times better than
other techniques
Cost of various techniques
28Real-life User study
- 11 users (FTEs and summer interns of our group,
August 2003) - 4 search tasks, 3 categorization techniques,
cross-balanced across users (each user did a task
only once using one of the categorization
techniques)
Users need to examine 5-10 items per relevant
tuple found
Average normalized cost (items examined by user
per relevant tuple found) of various techniques
29Summary
- Proposed to automatically generate a labeled,
hierarchical category structure to reduce
information overload - Developed analytical models to quantify
information overload - Accurately models information overload in real
life (90 correlation) - Based on those models, developed algorithms to
generate the tree that minimizes information
overload - Our cost-based technique outperforms other
techniques by 1-2 orders of magnitude - Significantly reduces the problem of information
overload
Thank you!