Automatic Categorization of Query Results - PowerPoint PPT Presentation

1 / 29
About This Presentation

Automatic Categorization of Query Results


tset of this node is the set of homes located either in Redmond or in Bellevue ... Neighborhood: Redmond, Bellevue. Neighborhood: Seattle. Price: 200K -225K ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 30
Provided by: kaus4


Transcript and Presenter's Notes

Title: Automatic Categorization of Query Results

Automatic Categorization of Query Results
  • Kaushik Chakrabarti Surajit Chaudhuri
    Seung-won Hwang
  • Data Management Exploration and Mining Group
  • Microsoft Research
  • SIGMOD 2004

Exploratory Queries
Database systems increasingly used for
interactive, exploratory retrieval
Home Searching Application
Job Searching Application
Keyword Search over Knowledge Bases
Product Catalog Searching
Information Overload
  • In exploratory retrieval, queries could return
    too many answers
  • Tiny fraction of retrieved items relevant to user
    ? finding them leads to wastage of time and
  • Causes
  • User is not certain of what she is looking for
    (e.g., looking for medium-priced homes in
    Seattle area but unable to specify exact
  • User is naïve and refrains from using advanced
    search features
  • Browsing instead of searching
  • Manual query reformulation is difficult and

Reducing Information Overload
  • Internet Text Search
  • Categorization Group search results into labeled
    categories user explores just the relevant
    categories and ignores the rest
  • Ranking Place the most relevant answers at the
    top users examines the first 10-20 results and
    ignores the rest
  • Database Systems
  • Ranking Received some attention recently (last
    3-4 years)
  • Categorization Focus of this paper

Problems with Existing Techniques
  • Category structure created a-priori (typically a
    manual process)
  • Items tagged (assigned categories) a-priori
  • At search time each search result placed under
    pre-assigned category
  • Susceptible to skew defeats purpose of

Our solution
  • Categorize results of SQL queries automatically
  • Generate labeled, hierarchical category structure
    dynamically based on the contents of the tuples
    in the result set
  • Does not suffer from problems as in a-priori

Outline of talk
  • Exploration/Cost Models to quantify information
    overload faced by an user during an exploration
  • Heuristic Algorithms to find good (low cost)
  • Experiments to evaluate models/algorithms

Hierarchical Categorization
Recursive partitioning of tuples in the result
set based on data attributes and its values
Find homes in Seattle/Bellevue Area of
Washington, USA in the 200K - 300K price range
Which categorization is the best?
  • Many possible categorizations for a given result
    set at each level, any distinct choice of
    categorizing attribute and partitioning generates
    a different categorization
  • We want to choose the categorization that
    minimizes the information overload on the user
  • In order to quantify information overload, we
    first need a model that captures how a user
    navigates the result set using a given tree

Exploration Model
  • Defines how the user navigates the result set
    using a given tree T
  • 2 scenarios
  • ALL Scenario User explores the result set using
    the category tree until she finds every tuple
    relevant to her
  • ONE/FEW Scenario User explores the result set
    until she finds one/few relevant tuple

Given user has decided to explore C, how does she
explore C?
Option SHOWTUPLES Browse through all tuples in
Ignore C1 or Explore C1
Ignore C2 or Explore C2
Option SHOWCAT Examine labels of all
subcategories, exploring the ones relevant to her
and ignoring the rest
Ignore Cn or Explore Cn
An Example Exploration
Entire category tree
An exploration on the tree
ALL (root)
Cost of an Exploration
Information Overload Cost (or simply Cost)
Cost(X,T) of a given user exploration X using a
given category tree T is the total number of
items (which includes both category labels and
data tuples) examined by the user during X.
ALL (root)
Say tset(C) for this category is 20, then
Cost(X,T) 13320 27
Average Cost
  • If we knew the users exploration X is advance,
    we could generate the tree T that minimizes the
    cost Cost(X,T) for the exploration
  • Since we do not know that, we use aggregate
    knowledge of previous user behavior to estimate
    the cost Cost(T) that a user will face, on
    average, during an exploration using a given
    category tree T.
  • Cost(T) is the number of items (which includes
    category labels and data tuples) that a user will
    need to examine, on average, during the
    exploration of the result set using T.
  • Categorization Problem Find the tree Tmin
    argminT Cost(T)

Computing the Average Cost
  • We need to know two probabilities associated with
    each node C
  • Exploration Probability P(C) P(User explores C
    User examines label of C)
  • SHOWTUPLES Probability Pw(C) P(User chooses
    SHOWTUPLES for C User explores C)
  • Consider a non-leaf node C of T.
  • Let Cost(C) denote cost of exploring subtree
    rooted at C given that the user has chosen to
    explore C
    probSHOWCAT cost
  • Pw(C) tset(C)
    (1-Pw(C)) (Kn
  • Cost(T) is simply Cost(root)

Estimating Probabilities
  • We use aggregate knowledge of previous user
  • Specifically, we look at the log of queries that
    users of this application have asked in the past
    (easy to obtain)

Estimating SHOWTUPLES Probability
SHOWTUPLES Probability of C P(User chooses
SHOWTUPLES for C User explores C)
Given user explores C, she has two choices
User chooses this if subcategorizing attribute
SA(C) is such that she is interested in only a
few subcategories (i.e., few values of SA(C))
User chooses this if she is interested in all or
most of the subcategories (i.e., all or most
values of SA(C))
Estimating Exploration Probability
  • Exploration probability P(C) P(User explores C
    User examines label of C)
  • P(C)

Estimating Exploration Probability
  • Assuming users interest in different attributes
    to be mutually independent,
  • P(C)
  • Workload query Wi represents information need of
    user Ui
  • User Ui interested in label Price200K-225K iff
    Wi contains selection condition on Price and that
    condition overlaps with label predicate
  • P(C)

Outline of talk
  • Exploration/Cost Models to quantify information
    overload faced by an user during an exploration
  • Heuristic Algorithms to find good (low cost)
  • Experiments to evaluate models/algorithms

Categorization Algorithm
  • Categorization Problem Find the tree Tmin
    argminT Cost(T)
  • Enumerative algorithm will produce optimal tree
    but is prohibitively expensive
  • Heuristic techniques to reduce search space
    (1-level case first)
  • Eliminate unattractive attributes without
    considering any of their partitionings
  • For each attribute retained, obtain a good
    partitioning efficiently (without enumerating all
    possible partitionings)
  • Choose the attribute-partitioning combination
    with the least cost.

Attribute Elimination
  • Attributes that occur infrequently in selection
    conditions in the queries in the workload can be
    discarded right away
  • Categorizing Attribute is low occurring ?
    SHOWTUPLES probability is high
  • Typically, SHOWTUPLES cost gtgt SHOWCAT cost
  • Categorizing Attribute is low occurring ? high
    cost irrespective of partitioning
  • Eliminate attributes that occurrence fraction lt x

Partitioning for Categorical Attributes
  • We consider only single-value partitionings
  • We only need to find the right ordering
  • Among all orderings, cost is minimum when the
    categories are presented in increasing order of
    1/P(Ci) Cost(Ci)
  • Intuition
  • If user is likely to drill into a category,
    present it earlier
  • If the exploration cost is low, place it earlier
  • Cost(Ci) is complex to compute in the multilevel
  • We use the following simple heuristic present
    categories in decreasing order of P(Ci)

Partitioning for Numeric Attributes
  • Consider the simple case where we want to
    partition the range (vmin, vmax into two buckets
  • We want to identify the best point to split
  • If a lot of query ranges begin/end at v, v is a
    good point to split
  • Goodness score of a splitpoint v SUM(startv,
  • Heuristic Choose the (m-1) best splitpoints
    (based on goodness scores) that are necessary
  • Above algorithm produces cost-optimal
    partitioning in some special cases but good
    partitionings in general.

Multilevel Categorization
  • At each level l
  • Determine the categorizing attribute A
  • For each category C in level (l-1) with
    tset(C)gtM, partition C using A
  • Detailed algorithm in paper

Outline of talk
  • Exploration/Cost Models to quantify information
    overload faced by an user during an exploration
  • Heuristic Algorithms to find good (low cost)
  • Experiments to evaluate models/algorithms

  • Goals
  • Evaluate accuracy of our cost models
  • Evaluate our cost-based categorization algorithm
    and compare it with techniques that do not
    consider such cost models
  • Dataset
  • Home listing database (1.7 million homes) from
    MSN HouseHome
  • Attributes neighborhood, price, bedroomcount,
    bathcount, year-built, property-type,
  • Workload
  • 176,262 SQL query strings representing searches
    conducted by real buyers on MSN HouseHome web
  • 2 studies
  • Large scale, simulated, cross-validated user
  • Real-life User Study
  • Compare our cost-based algorithm to No Cost and
    Attr-cost techniques

Large-scale Simulated User Study
  • Treat a workload query as a user exploration
  • Broaden the exploration to generate the user
    query for which the tree is generated
  • 8-fold cross validation (each subset has 100
  • Strong positive correlation between actual cost
    and estimated average cost
  • Cost-based techniques about 5 times better than
    other techniques

Cost of various techniques
Real-life User study
  • 11 users (FTEs and summer interns of our group,
    August 2003)
  • 4 search tasks, 3 categorization techniques,
    cross-balanced across users (each user did a task
    only once using one of the categorization

Users need to examine 5-10 items per relevant
tuple found
Average normalized cost (items examined by user
per relevant tuple found) of various techniques
  • Proposed to automatically generate a labeled,
    hierarchical category structure to reduce
    information overload
  • Developed analytical models to quantify
    information overload
  • Accurately models information overload in real
    life (90 correlation)
  • Based on those models, developed algorithms to
    generate the tree that minimizes information
  • Our cost-based technique outperforms other
    techniques by 1-2 orders of magnitude
  • Significantly reduces the problem of information

Thank you!
Write a Comment
User Comments (0)