Title: Automatic Categorization of Query Results
 1Automatic Categorization of Query Results
- Kaushik Chakrabarti Surajit Chaudhuri 
 Seung-won Hwang  - Data Management Exploration and Mining Group 
 - Microsoft Research 
 - SIGMOD 2004
 
  2Exploratory Queries
Database systems increasingly used for 
interactive, exploratory retrieval
Home Searching Application
Job Searching Application
Keyword Search over Knowledge Bases
Product Catalog Searching 
 3Information Overload
- In exploratory retrieval, queries could return 
too many answers  - Tiny fraction of retrieved items relevant to user 
? finding them leads to wastage of time and 
effort  - Causes 
 - User is not certain of what she is looking for 
(e.g., looking for medium-priced homes in 
Seattle area but unable to specify exact 
price/neighborhood)  - User is naïve and refrains from using advanced 
search features  - Browsing instead of searching 
 - Manual query reformulation is difficult and 
time-consuming  
  4Reducing Information Overload
- Internet Text Search 
 - Categorization Group search results into labeled 
categories user explores just the relevant 
categories and ignores the rest  - Ranking Place the most relevant answers at the 
top users examines the first 10-20 results and 
ignores the rest  - Database Systems 
 - Ranking Received some attention recently (last 
3-4 years)  - Categorization Focus of this paper 
 
  5Problems with Existing Techniques
- Category structure created a-priori (typically a 
manual process)  - Items tagged (assigned categories) a-priori 
(manual/semi-automatic)  - At search time each search result placed under 
pre-assigned category  - Susceptible to skew  defeats purpose of 
categorization 
  6Our solution
- Categorize results of SQL queries automatically 
 - Generate labeled, hierarchical category structure 
dynamically based on the contents of the tuples 
in the result set  - Does not suffer from problems as in a-priori 
categorization  
  7Outline of talk
- Exploration/Cost Models to quantify information 
overload faced by an user during an exploration  - Heuristic Algorithms to find good (low cost) 
categorizations  - Experiments to evaluate models/algorithms 
 
  8Hierarchical Categorization
Recursive partitioning of tuples in the result 
set based on data attributes and its values
Find homes in Seattle/Bellevue Area of 
Washington, USA in the 200K - 300K price range 
 9Which categorization is the best?
- Many possible categorizations for a given result 
set at each level, any distinct choice of 
categorizing attribute and partitioning generates 
a different categorization 
- We want to choose the categorization that 
minimizes the information overload on the user  - In order to quantify information overload, we 
first need a model that captures how a user 
navigates the result set using a given tree  
  10Exploration Model
-  Defines how the user navigates the result set 
using a given tree T  -  2 scenarios 
 -  ALL Scenario User explores the result set using 
the category tree until she finds every tuple 
relevant to her  -  ONE/FEW Scenario User explores the result set 
until she finds one/few relevant tuple 
Given user has decided to explore C, how does she 
explore C?
Option SHOWTUPLES Browse through all tuples in 
tset(C)
Ignore C1 or Explore C1
Ignore C2 or Explore C2
Option SHOWCAT Examine labels of all 
subcategories, exploring the ones relevant to her 
and ignoring the rest
Ignore Cn or Explore Cn 
 11An Example Exploration 
Entire category tree
An exploration on the tree
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
IGNORE 
 12Cost of an Exploration
Information Overload Cost (or simply Cost) 
Cost(X,T) of a given user exploration X using a 
given category tree T is the total number of 
items (which includes both category labels and 
data tuples) examined by the user during X.
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
Say tset(C) for this category is 20, then 
Cost(X,T)  13320  27
IGNORE 
 13Average Cost 
- If we knew the users exploration X is advance, 
we could generate the tree T that minimizes the 
cost Cost(X,T) for the exploration  - Since we do not know that, we use aggregate 
knowledge of previous user behavior to estimate 
the cost Cost(T) that a user will face, on 
average, during an exploration using a given 
category tree T.  - Cost(T) is the number of items (which includes 
category labels and data tuples) that a user will 
need to examine, on average, during the 
exploration of the result set using T.  - Categorization Problem Find the tree Tmin  
argminT Cost(T)  
  14Computing the Average Cost
- We need to know two probabilities associated with 
each node C  - Exploration Probability P(C) P(User explores C 
User examines label of C)  - SHOWTUPLES Probability Pw(C)  P(User chooses 
SHOWTUPLES for C User explores C)  
-  Consider a non-leaf node C of T. 
 -  Let Cost(C) denote cost of exploring subtree 
rooted at C given that the user has chosen to 
explore C  - Cost(C)  SHOWTUP probSHOWTUP cost  SHOWCAT 
probSHOWCAT cost  -   Pw(C)  tset(C) 
  (1-Pw(C)) (Kn  
P(Ci)Cost(Ci))  - Cost(T) is simply Cost(root)
 
  15Estimating Probabilities
- We use aggregate knowledge of previous user 
behavior  - Specifically, we look at the log of queries that 
users of this application have asked in the past 
(easy to obtain)  
  16Estimating SHOWTUPLES Probability 
SHOWTUPLES Probability of C  P(User chooses 
SHOWTUPLES for C User explores C)
Given user explores C, she has two choices
User chooses this if subcategorizing attribute 
SA(C) is such that she is interested in only a 
few subcategories (i.e., few values of SA(C))
User chooses this if she is interested in all or 
most of the subcategories (i.e., all or most 
values of SA(C)) 
 17Estimating Exploration Probability
- Exploration probability P(C)  P(User explores C 
User examines label of C)  
  18Estimating Exploration Probability
- Assuming users interest in different attributes 
to be mutually independent,  -  P(C)  
 - Workload query Wi represents information need of 
user Ui  - User Ui interested in label Price200K-225K iff 
Wi contains selection condition on Price and that 
condition overlaps with label predicate 
200K-225K  - P(C)  
 
  19Outline of talk
- Exploration/Cost Models to quantify information 
overload faced by an user during an exploration  - Heuristic Algorithms to find good (low cost) 
categorizations  - Experiments to evaluate models/algorithms 
 
  20Categorization Algorithm
- Categorization Problem Find the tree Tmin  
argminT Cost(T)  - Enumerative algorithm will produce optimal tree 
but is prohibitively expensive  - Heuristic techniques to reduce search space 
(1-level case first)  - Eliminate unattractive attributes without 
considering any of their partitionings  - For each attribute retained, obtain a good 
partitioning efficiently (without enumerating all 
possible partitionings)  - Choose the attribute-partitioning combination 
with the least cost.  
  21Attribute Elimination
-  Attributes that occur infrequently in selection 
conditions in the queries in the workload can be 
discarded right away  -  Categorizing Attribute is low occurring ? 
SHOWTUPLES probability is high  -  Typically, SHOWTUPLES cost gtgt SHOWCAT cost 
 -  Categorizing Attribute is low occurring ? high 
cost irrespective of partitioning  -  Eliminate attributes that occurrence fraction lt x
 
  22Partitioning for Categorical Attributes
- We consider only single-value partitionings 
 - We only need to find the right ordering 
 - Among all orderings, cost is minimum when the 
categories are presented in increasing order of 
1/P(Ci)  Cost(Ci)  - Intuition 
 - If user is likely to drill into a category, 
present it earlier  - If the exploration cost is low, place it earlier 
 - Cost(Ci) is complex to compute in the multilevel 
case  - We use the following simple heuristic present 
categories in decreasing order of P(Ci) 
  23Partitioning for Numeric Attributes
-  Consider the simple case where we want to 
partition the range (vmin, vmax into two buckets  -  We want to identify the best point to split 
 -  If a lot of query ranges begin/end at v, v is a 
good point to split  -  Goodness score of a splitpoint v SUM(startv, 
endv)  -  Heuristic Choose the (m-1) best splitpoints 
(based on goodness scores) that are necessary  -  Above algorithm produces cost-optimal 
partitioning in some special cases but good 
partitionings in general. 
  24Multilevel Categorization
- At each level l 
 - Determine the categorizing attribute A 
 - For each category C in level (l-1) with 
tset(C)gtM, partition C using A  - Detailed algorithm in paper
 
  25Outline of talk
- Exploration/Cost Models to quantify information 
overload faced by an user during an exploration  - Heuristic Algorithms to find good (low cost) 
categorizations  - Experiments to evaluate models/algorithms 
 
  26Experiments
- Goals 
 - Evaluate accuracy of our cost models 
 - Evaluate our cost-based categorization algorithm 
and compare it with techniques that do not 
consider such cost models  - Dataset 
 - Home listing database (1.7 million homes) from 
MSN HouseHome  - Attributes neighborhood, price, bedroomcount, 
bathcount, year-built, property-type, 
square-footage  - Workload 
 - 176,262 SQL query strings representing searches 
conducted by real buyers on MSN HouseHome web 
site  - 2 studies 
 - Large scale, simulated, cross-validated user 
study  - Real-life User Study 
 - Compare our cost-based algorithm to No Cost and 
Attr-cost techniques  
  27Large-scale Simulated User Study
- Treat a workload query as a user exploration 
 - Broaden the exploration to generate the user 
query for which the tree is generated  - 8-fold cross validation (each subset has 100 
queries)  - Strong positive correlation between actual cost 
and estimated average cost  - Cost-based techniques about 5 times better than 
other techniques 
Cost of various techniques 
 28Real-life User study
-  11 users (FTEs and summer interns of our group, 
August 2003)  -  4 search tasks, 3 categorization techniques, 
cross-balanced across users (each user did a task 
only once using one of the categorization 
techniques) 
Users need to examine 5-10 items per relevant 
tuple found
Average normalized cost (items examined by user 
per relevant tuple found) of various techniques 
 29Summary
- Proposed to automatically generate a labeled, 
hierarchical category structure to reduce 
information overload  - Developed analytical models to quantify 
information overload  - Accurately models information overload in real 
life (90 correlation)  - Based on those models, developed algorithms to 
generate the tree that minimizes information 
overload  - Our cost-based technique outperforms other 
techniques by 1-2 orders of magnitude  - Significantly reduces the problem of information 
overload  
Thank you!