Automatic Categorization of Query Results presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automatic Categorization of Query Results

1
Automatic Categorization of Query Results

Kaushik Chakrabarti Surajit Chaudhuri
Seung-won Hwang
Data Management Exploration and Mining Group
Microsoft Research
SIGMOD 2004

2
Exploratory Queries
Database systems increasingly used for
interactive, exploratory retrieval
Home Searching Application
Job Searching Application
Keyword Search over Knowledge Bases
Product Catalog Searching
3
Information Overload

In exploratory retrieval, queries could return
too many answers
Tiny fraction of retrieved items relevant to user
? finding them leads to wastage of time and
effort
Causes
User is not certain of what she is looking for
(e.g., looking for medium-priced homes in
Seattle area but unable to specify exact
price/neighborhood)
User is naïve and refrains from using advanced
search features
Browsing instead of searching
Manual query reformulation is difficult and
time-consuming

4
Reducing Information Overload

Internet Text Search
Categorization Group search results into labeled
categories user explores just the relevant
categories and ignores the rest
Ranking Place the most relevant answers at the
top users examines the first 10-20 results and
ignores the rest
Database Systems
Ranking Received some attention recently (last
3-4 years)
Categorization Focus of this paper

5
Problems with Existing Techniques

Category structure created a-priori (typically a
manual process)
Items tagged (assigned categories) a-priori
(manual/semi-automatic)
At search time each search result placed under
pre-assigned category
Susceptible to skew defeats purpose of
categorization

6
Our solution

Categorize results of SQL queries automatically
Generate labeled, hierarchical category structure
dynamically based on the contents of the tuples
in the result set
Does not suffer from problems as in a-priori
categorization

7
Outline of talk

Exploration/Cost Models to quantify information
overload faced by an user during an exploration
Heuristic Algorithms to find good (low cost)
categorizations
Experiments to evaluate models/algorithms

8
Hierarchical Categorization
Recursive partitioning of tuples in the result
set based on data attributes and its values
Find homes in Seattle/Bellevue Area of
Washington, USA in the 200K - 300K price range
9
Which categorization is the best?

Many possible categorizations for a given result
set at each level, any distinct choice of
categorizing attribute and partitioning generates
a different categorization

We want to choose the categorization that
minimizes the information overload on the user
In order to quantify information overload, we
first need a model that captures how a user
navigates the result set using a given tree

10
Exploration Model

Defines how the user navigates the result set
using a given tree T
2 scenarios
ALL Scenario User explores the result set using
the category tree until she finds every tuple
relevant to her
ONE/FEW Scenario User explores the result set
until she finds one/few relevant tuple

Given user has decided to explore C, how does she
explore C?
Option SHOWTUPLES Browse through all tuples in
tset(C)
Ignore C1 or Explore C1
Ignore C2 or Explore C2
Option SHOWCAT Examine labels of all
subcategories, exploring the ones relevant to her
and ignoring the rest
Ignore Cn or Explore Cn
11
An Example Exploration
Entire category tree
An exploration on the tree
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
IGNORE
12
Cost of an Exploration
Information Overload Cost (or simply Cost)
Cost(X,T) of a given user exploration X using a
given category tree T is the total number of
items (which includes both category labels and
data tuples) examined by the user during X.
IGNORE
SHOWCAT
SHOWTUPLES
IGNORE
IGNORE
SHOWCAT
ALL (root)
Say tset(C) for this category is 20, then
Cost(X,T) 13320 27
IGNORE
13
Average Cost

If we knew the users exploration X is advance,
we could generate the tree T that minimizes the
cost Cost(X,T) for the exploration
Since we do not know that, we use aggregate
knowledge of previous user behavior to estimate
the cost Cost(T) that a user will face, on
average, during an exploration using a given
category tree T.
Cost(T) is the number of items (which includes
category labels and data tuples) that a user will
need to examine, on average, during the
exploration of the result set using T.
Categorization Problem Find the tree Tmin
argminT Cost(T)

14
Computing the Average Cost

We need to know two probabilities associated with
each node C
Exploration Probability P(C) P(User explores C
User examines label of C)
SHOWTUPLES Probability Pw(C) P(User chooses
SHOWTUPLES for C User explores C)

Consider a non-leaf node C of T.
Let Cost(C) denote cost of exploring subtree
rooted at C given that the user has chosen to
explore C
Cost(C) SHOWTUP probSHOWTUP cost SHOWCAT
probSHOWCAT cost
Pw(C) tset(C)
(1-Pw(C)) (Kn
P(Ci)Cost(Ci))
Cost(T) is simply Cost(root)

15
Estimating Probabilities

We use aggregate knowledge of previous user
behavior
Specifically, we look at the log of queries that
users of this application have asked in the past
(easy to obtain)

16
Estimating SHOWTUPLES Probability
SHOWTUPLES Probability of C P(User chooses
SHOWTUPLES for C User explores C)
Given user explores C, she has two choices
User chooses this if subcategorizing attribute
SA(C) is such that she is interested in only a
few subcategories (i.e., few values of SA(C))
User chooses this if she is interested in all or
most of the subcategories (i.e., all or most
values of SA(C))
17
Estimating Exploration Probability

Exploration probability P(C) P(User explores C
User examines label of C)

P(C)

18
Estimating Exploration Probability

Assuming users interest in different attributes
to be mutually independent,
P(C)
Workload query Wi represents information need of
user Ui
User Ui interested in label Price200K-225K iff
Wi contains selection condition on Price and that
condition overlaps with label predicate
200K-225K
P(C)

19
Outline of talk

Exploration/Cost Models to quantify information
overload faced by an user during an exploration
Heuristic Algorithms to find good (low cost)
categorizations
Experiments to evaluate models/algorithms

20
Categorization Algorithm

Categorization Problem Find the tree Tmin
argminT Cost(T)
Enumerative algorithm will produce optimal tree
but is prohibitively expensive
Heuristic techniques to reduce search space
(1-level case first)
Eliminate unattractive attributes without
considering any of their partitionings
For each attribute retained, obtain a good
partitioning efficiently (without enumerating all
possible partitionings)
Choose the attribute-partitioning combination
with the least cost.

21
Attribute Elimination

Attributes that occur infrequently in selection
conditions in the queries in the workload can be
discarded right away
Categorizing Attribute is low occurring ?
SHOWTUPLES probability is high
Typically, SHOWTUPLES cost gtgt SHOWCAT cost
Categorizing Attribute is low occurring ? high
cost irrespective of partitioning
Eliminate attributes that occurrence fraction lt x

22
Partitioning for Categorical Attributes

We consider only single-value partitionings
We only need to find the right ordering
Among all orderings, cost is minimum when the
categories are presented in increasing order of
1/P(Ci) Cost(Ci)
Intuition
If user is likely to drill into a category,
present it earlier
If the exploration cost is low, place it earlier
Cost(Ci) is complex to compute in the multilevel
case
We use the following simple heuristic present
categories in decreasing order of P(Ci)

23
Partitioning for Numeric Attributes

Consider the simple case where we want to
partition the range (vmin, vmax into two buckets
We want to identify the best point to split
If a lot of query ranges begin/end at v, v is a
good point to split
Goodness score of a splitpoint v SUM(startv,
endv)
Heuristic Choose the (m-1) best splitpoints
(based on goodness scores) that are necessary
Above algorithm produces cost-optimal
partitioning in some special cases but good
partitionings in general.

24
Multilevel Categorization

At each level l
Determine the categorizing attribute A
For each category C in level (l-1) with
tset(C)gtM, partition C using A
Detailed algorithm in paper

25
Outline of talk

Exploration/Cost Models to quantify information
overload faced by an user during an exploration
Heuristic Algorithms to find good (low cost)
categorizations
Experiments to evaluate models/algorithms

26
Experiments

Goals
Evaluate accuracy of our cost models
Evaluate our cost-based categorization algorithm
and compare it with techniques that do not
consider such cost models
Dataset
Home listing database (1.7 million homes) from
MSN HouseHome
Attributes neighborhood, price, bedroomcount,
bathcount, year-built, property-type,
square-footage
Workload
176,262 SQL query strings representing searches
conducted by real buyers on MSN HouseHome web
site
2 studies
Large scale, simulated, cross-validated user
study
Real-life User Study
Compare our cost-based algorithm to No Cost and
Attr-cost techniques

27
Large-scale Simulated User Study

Treat a workload query as a user exploration
Broaden the exploration to generate the user
query for which the tree is generated
8-fold cross validation (each subset has 100
queries)
Strong positive correlation between actual cost
and estimated average cost
Cost-based techniques about 5 times better than
other techniques

Cost of various techniques
28
Real-life User study

11 users (FTEs and summer interns of our group,
August 2003)
4 search tasks, 3 categorization techniques,
cross-balanced across users (each user did a task
only once using one of the categorization
techniques)

Users need to examine 5-10 items per relevant
tuple found
Average normalized cost (items examined by user
per relevant tuple found) of various techniques
29
Summary

Proposed to automatically generate a labeled,
hierarchical category structure to reduce
information overload
Developed analytical models to quantify
information overload
Accurately models information overload in real
life (90 correlation)
Based on those models, developed algorithms to
generate the tree that minimizes information
overload
Our cost-based technique outperforms other
techniques by 1-2 orders of magnitude
Significantly reduces the problem of information
overload

Thank you!

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Categorization of Query Results PowerPoint PPT Presentation