Automatic Categorization of Query Results - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Categorization of Query Results

Description:

Example :`Neighborhood : Redmond, Bellevue' and `Price : 200k - 225k' ... Example : Neighborhood IN {'Neighborhood:Redmond', 'Neighborhood:Bellevue', etc. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 44
Provided by: Arj12
Learn more at: https://ranger.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Categorization of Query Results


1
Automatic Categorization of Query Results
  • A Paper by
  • Kaushik Chakarbati, Surajit Chaudhari, Seung -won
    Hwang
  • Presented by
  • Arjun Saraswat

2
Flow of the Presentation
  • 1.Introduction
  • 2.Motivation
  • 3.Basics of Model
  • 4.Cost Estimation
  • 5.Algorithms
  • 6.Experimental Evaluation
  • 7.Conclusion

3
INTRODUCTION
4
Introduction
  • This paper basically solves the too many
    answers
  • problem.
  • This phenomenon of too many answers is often
    referred to
  • as Information overload .
  • Information overload happens when the user is not
    certain
  • what she is looking for, In such situations
    user generally
  • fires a broad query in order to avoid
    exclusion of potentially interesting results.
  • There are two techniques to handle information
    overload
  • Categorization and Ranking, this paper talks
    about the
  • categorization Technique.

5
MOTIVATION
6
Motivation
  • Example A user fires a query on the MSN House
    Home
  • Database with following specifications
  • Area Seattle/Bellevue Area of Washington, USA
  • Price Range 200,000 to 300,000
  • The query returns 6045 results, it is hard for
    the user to
  • Separate the interesting ones from the
    uninteresting ones, which leads to lot of wastage
    of user time and effort.
  • This problem is solved by the Categorization
    techniques
  • introduced by this paper, such queries are
    answered by
  • hierarchal category structure that are based
    on the
  • contents of the answer set.
  • The main motive is to reduce the information
    overload

7
MotivationFig1.Structured hierarchal
categorization results of the Example Query
8
Basics of Model
9
Basics of Model
  • R set of tuples or it can be either base
    relation or
  • materialized view or result of a query Q.
  • Q SPJ (select-project-join) query.
  • A hierarchal categorization of R is a recursive
    partitioning of
  • the tuples in R based on the data attributes and
    their values,
  • this is shown in Fig.1.
  • Base Case At the root or level 0 contains all
    the tuples in R,
  • this tuple set is partitioned into mutually
    disjoint categories
  • using a single attribute.
  • Inductive Step At a given node C at level
    (l-1), the partitioning
  • of set of tuples tset(C) contained in C in
    ordered mutually
  • disjoint subcategories (level l nodes) is done
    using the attribute
  • which is same for all nodes at level(l-1).

10
Basics of Model
  • The partitioning of node C is only done if it
    contains more than certain number of tuples
    and the attribute on which it is done is
  • called categorizing attribute of level l and
    sub-categorizing
  • Attribute of level (l-1).
  • An attribute used once is not used again at later
    levels.
  • Category Label The predicate label (C)
    describing node C.
  • Example Neighborhood Redmond, Bellevue
    and Price 200k - 225k
  • Tuple-set (tset(C)) The set of tuples contained
    in C, either
  • occurring directly or indirectly under its
    subcategories.
  • Example tset for category with label
    'Neighborhood Seattle, is the set of all homes
    in R that are located in Seattle.

11
Basics of Model
  • Important points to remember for each level
  • Determine the categorizing attribute for that
    level.
  • Attribute partitioning is done in such a way as
    to minimize the information overload on the user.
  • Exploration Model
  • It has two models that capture the two common
    scenarios.
  • 1.All Scenario.
  • 2.One Scenario.

12
Basics of Model
  • The model of exploration of the subtree rooted at
    an arbitrary
  • node C
  • EXPLORE C
  • if C is non leaf node
  • CHOOSE one of the following
  • (1)Examine all tuples in tset(C)//option
    SHOWTUPLES
  • (2)for(i1ini)//option SHOWCAT
  • Examine Label of this subcategory Ci
  • CHOOSE one of the following
  • (2.1)EXPLORE Ci
  • (2.2)Ignore Ci
  • else//C is a leaf node
  • Examine all tuples in tset(C)//SHOWTUPLES is only
    option

13
Basic of Model
  • 2.One Scenario
  • EXPLORE C
  • if C is non leaf node
  • CHOOSE one of the following
  • (1)Examine all tuples in tset(C) from the
    beginning
  • till first relevant tuple found//option
    SHOWTUPLES
  • (2)for(i1ini)//option SHOWCAT
  • Examine Label of the ith subcategory Ci
  • CHOOSE one of the following
  • (2.1)EXPLORE Ci
  • (2.2)Ignore Ci
  • If (choiceExplore) break//examine till first
    relevant tuple
  • else//C is a leaf node
  • Examine all tuples in tset(C) from beginning till
    first
  • relevant tuple found//SHOWTUPLES is only option

14
Cost Estimation
15
Cost Estimation
  • Cost Model for All Scenario
  • CostAll (X,T) information overload cost or
    simply cost.
  • X a given user exploration.
  • T Tree.
  • We want to generate the tree that would minimize
    the number of items this particular user needs to
    examine.
  • We use the aggregate knowledge of previous user
    behavior in order to estimate the information
    overload cost CostAll(T) that a user will face,
    on average, during an exploration using a given
    category tree T.

16
Cost Estimation
  • Exploration Probability The probability P(C)
    that the user exploring T
  • explores category C, using either SHOWTUPLES or
    SHOWCAT, upon
  • examining its label.
  • SHOWTUPLES Probability The probability Pw(C)
    that the user goes for
  • option SHOWTUPLES for category C given that she
    explores C. The
  • SHOWCAT probability of C.
  • Cost Model for All Scenario
  • Consider a Non-Leaf Node C of T
  • CostAll(Tc) cost of exploring the subtree Tc
    rooted at C we denote
  • CostAll(Tc) by CostAll (C) as we know the cost is
    always calculated in
  • context of the given Tree.

17
Cost Estimation
  • If SHOWTUPLES is Chosen for C then Cost
  • Pw(C)tset(C)
  • If SHOWCAT is Chosen for C the Cost
  • Cost of first component K n (where K is the
    cost of examining a category label relative to
    the cost of examining a data tuple.)
  • Cost of Second Component CostAll (Ci), if she
    chooses to explore Ci, 0 if she chooses to
    ignore it.
  • CostAll (C) Pw(C)tset(C)(1-Pw(C)) (Kn
    S P(Ci)CostAll(Ci)) (1)
  • If C is leaf node then CostAll (C) tset(C)

18
Cost Estimation
  • Cost Model for ONE Scenario
  • CostOne(T) information overload cost Let us
    consider the
  • Cost for SHOWTUPLES Option Pw(C)frac(C)tset(
    C)
  • Cost for SHOWCAT option
  • (1-Pw(C)) S (Prob. that Ci is the first
    category explored (Ki CostOne (Ci)))
  • Total Cost of One Scenario
  • CostOne(C) Pw(C)frac(C)tset(C) (1-Pw(C))
  • S (Prob. that Ci is the first category explored
    (Ki CostOne (Ci)))

19
Cost Estimation
  • S (Prob. that Ci is the first category explored
    (Ki CostOne(Ci)))
  • The probability that Ci is the first category
    explored (i.e.,
  • probability that the user explores Ci but none of
    C1 to C(i-1)), is
  • (i-1) j
    1?(1-P(Cj)) P(Ci)
  • Final CostOne Term
  • CostOne(C) Pw(C)frac(C)tset(C) (1-Pw(C))
    i1nS(i-1) j 1?(1-P(Cj))
  • P(Ci) (Ki CostOne (Ci)))
    (2)
  • In case C is a leaf node then CostOne(C)
    frac(C)tset(C)

20
Cost Estimation
  • Using Workload to Estimate Probabilities
  • P(C) and Pw(C) are needed for the CostOne(T) and
    CostAll(T)
  • We use the aggregate knowledge of previous user
    behavior to estimate these probabilities
    automatically.
  • Computing SHOWTUPLES Probability
  • When a User explores the a non-leaf node C, there
    are two Choices SHOWCAT or SHOWTUPLES.
  • SA(C) Subcategorizing attribute of C

21
Cost Estimation
  • SHOWCAT Probability
  • Wi Workload Query.
  • Ui User.
  • If Ui has specified a selection condition on
  • SA(C) in Wi, given a condition on SA(C) means
    user
  • is interested in few values, if there is no
    condition it
  • means user is interested in all values SA(C).

22
Cost Estimation
  • NAttr(A) the number of queries in the workload
    that contain
  • selection condition on attribute A and N is
    the total number of queries in the workload.
  • NAttr(SA(C))/N fraction of users that are
    interested in a few
  • values of SA(C).
  • SHOWCAT probability of C NAttr (SA(C))/N
  • SHOWTUPLES probability of C (1- NAttr
    (SA(C))/N)

23
Cost Estimation
  • Computing Exploration Probability P(C)
  • P(C) probability that user explores category C.
  • P(C) P(User explores C User examines label of
    C).
  • P(C) P(User explores C) / P(User examines label
    of C).
  • User examines label if she explores the parent of
    label say C
  • and chooses SHOWCAT for C.
  • P(C) P( user explores C)/P(User explores C and
    chooses

  • SHOWCAT for
    C)
  • P( user
    explores C)
  • P( user explores C)P(User chooses SHOWCAT for
    Cuser


  • explores C)
  • P(User chooses SHOWCAT for C' User explores C)
    is the
  • SHOWCAT probability of C NAttr (SA(C))/N.


24
Cost Estimation
  • A user explores C if she, upon examining the
    label of
  • C, thinks that there may be one or more tuples in
  • tset(C) that is of interest to her.
  • P(User explores C) / P(User explores C) is
    simply the probability that the user is
    interested in predicate
  • label(C).
  • So, P (user interested in predicate
    label C)
  • P(C)
  • NAttr(SA(C))/ N

25
Cost Estimation
  • CA(C) Categorizing attribute of C
  • selection condition on CA(C) overlaps with the
    predicate
  • label(C), it means that Ui is interested in
    the predicate label(C).
  • NOverlap(C) number of queries in the workload
    Whose selection condition on CA(C) overlaps with
    label(C)
  • P(User interested in predicate label(C))
    NOverlap (C)/N.
  • So now we get

  • NOverlap (C)
  • P(C)
  • NAttr(CA(C))

26
Algorithms
27
Algorithms
  • Now we can calculate the information overload
    cost for a given
  • Tree, we can enumerate all possible category
    trees on R, and
  • Chose the one with minimum cost. This will give
    the Cost-
  • Optimal tree but will be expensive the sense that
    there will be
  • large number of categorization trees possible.
  • In order to solve this problem we need to
  • Eliminate a subset of relatively unattractive
    attributes without considering any of their
    partitioning .
  • For every attribute selected above, obtain a good
    partitioning efficiently instead of enumerating
    all the possible partitioning.

28
Algorithms
  • Reducing the Choices of Categorizing Attribute
  • 1.)Eliminate the uninteresting attributes using
    the following
  • simple heuristic if an attribute A occurs in
    less than a
  • fraction x of the queries in the workload, i.e.,
    NAttr(A)/N lt x,
  • we eliminate A. The threshold x will need to be
    specified by the
  • system designer/domain expert.
  • 2.) For attribute elimination, we preprocess the
    workload and
  • maintain, for each potential categorizing
    attribute A, the
  • number NAttr(A) of queries in the workload that
    contain
  • selection condition on A.

29
Algorithms
  • Partitioning for Categorical Attributes
  • In this paper only single value partitioning of
    R is considered.
  • Consider the case where the user query Q contains
    a selection condition of the form A IN v1, ,
    vk on A.
  • Example Neighborhood IN NeighborhoodRedmo
    nd,

  • NeighborhoodBellevue, etc.).
  • Among the single-value partitioning, we want to
    choose the one with the minimum cost.
  • Since the set of categories is identical in all
    possible single-value partitioning, the only
    factor that impacts the cost of a single valued
    partitioning is the order in which the categories
    are presented to the user.

30
Algorithms
  • The CostAll (T) is not affected by the ordering,
    so we will consider Only cost CostOne(T), now
    CostOne(T) is minimum when categories are
    presented in increasing order of 1/(P(Ci)
    CostOne(Ci).
  • Heuristic to present the categories in decreasing
    order of P(Ci).
  • P(Ci) NOverlap (Ci)/NAttr(A) , as Ci corresponds
    to a single value
  • NOverlap (Ci) is the number of queries in the
    workload, whose selection condition on A contains
    vi in the IN Clause
  • To obtain the partitioning we simply sort the
    values in IN clause in decreasing order of
    occ(vi).

31
Algorithms
32
Algorithms
  • Partitioning for Numeric Attributes
  • Let Vmin and Vmax be the minimum and maximum
    values that the
  • tuples in R can take in attribute A.
  • Let us consider a point v (Vmin lt v lt Vmax). If a
    significant number
  • of query ranges in the workload begin or end at
    v, it is a good point
  • to split as the workload suggests that most
    users would be interested
  • in just one bucket,
  • If none of them begin or end at v, hence v is not
    a good point to split, if we partition the range
    into m-buckets then (m-1) points should be
  • selected where queries begin or end
    splitpoints.
  • The splitpoints are not the only factors
    determining cost, the other
  • factor is the number of tuples in each bucket.
  • This kind of heuristic will not give best
    partitioning in the sense of cost.

33
Algorithms
  • Let us consider the point v
  • again (Vmin lt v lt Vmax). Let
  • startV and endV denote the
  • number of query ranges in the
  • Workload starting and ending
  • at v respectively. We use
  • SUM (startV, endV) as the
  • goodness score of the point
  • v.

34
Algorithms
  • Multilevel Categorization
  • ALGORITHM
  • 1.For multilevel categorization, for each level
    l, we need to determine the
  • categorizing attribute A and for each category C
    in level (l-1), partition the
  • domain of values of A in tset(C) such that the
    information overload is
  • minimized.
  • 2.The algorithm creates the categories level by
    level all categories at level
  • (l-1) are created and added to tree T before any
    category at level l. S denote
  • the set of categories at level (l-1) with more
    than M tuples.
  • 3.For each such candidate attribute A, we
    partition each category C in S
  • using the partitioning for Categorical
    Attributes and Numerical attributes
  • 4. Compute the cost of the attribute-partitioning
    combination for each
  • candidate attribute A and select the attribute a
    with the minimum cost. For
  • each category C in S, we add the partitions of C
    based on a to T.
  • 5. This Completes the node creation at level l.

35
Experimental Evaluation
  • Evaluation is done on the following
  • Evaluate the accuracy of cost models in modeling
    information overload.
  • Evaluate our cost based categorization algorithm
    and compare them with categorization that do not
    consider such cost models.
  • Database MSN HouseHome
  • M 20
  • All Experiments are a conducted on Compaq Evo
    W8000 1.7Ghz CPU
  • 768MB RAM, running on Windows XP.
  • Dataset for both the experiments Single table
    called ListProperty , it
  • contains 1.7 million rows.
  • Workload comprises 176,262 query strings
    representing searches
  • conducted by home buyers on MSN House Home
    website.
  • In both the studies papers cost based is
    compared to two techniques
  • No-Cost and Attr-Cost
  • No-Cost it uses the same level by
    categorization but categorizing attr-
  • -butes at each level arbitrarily (without
    replacement).

36
Experimental Evaluation
  • Attr-Cost Attr-cost technique selects the
  • attribute with the lowest cost as the
  • categorizing attribute at each level but
  • considers only those partitioning considered
  • by the No cost technique.
  • Simulated User-Study
  • Due to the difficulty of conducting a large-
  • scale real-life user study, we develop a novel
  • way to simulate a large scale user study. We
  • pick a subset of 100 queries from the
  • workload and imagine them as user
  • explorations, Workload Query W is referred
  • to as Synthetic Exploration.
  • estimated (average) cost CostAll(T)
  • actual cost CostAll(W,T) of exploration
  • 8 Mutually disjoint subsets of 100 synthetic
    explorations are considered.
  • Figure is Correlation between actual cost and
  • Estimated Cost.

37
Experimental Evaluation
  • Figure on the left is Cost of various techniques
    for 8
  • Subsets.
  • Figure on the right is Pearson's correlation
    between the
  • estimated cost and actual cost.

38
Experimental EvaluationReal Life User study
  • Tasks
  • 1. Any neighborhood in Seattle/Bellevue, Price lt
    1 Million.
  • 2. Any neighborhood in Bay Area Penin
  • /SanJose, Price between 300K and 500K
  • 3. 15 selected neighborhoods in NYC
  • Manhattan, Bronx, Price lt 1 Million
  • 4. Any neighborhood in Seattle/Bellevue, Price
    between 200K and 400K, Bedroom Count between 3
    and 4.

39
Experimental EvaluationReal Life User study
  • Figure on the left is Average Cost (no. of items
    examined till she finds all the relevant tuples)
    of various techniques
  • Figure on the right is Average number relevant
    tuples found by users for the various techniques.

40
Experimental EvaluationReal Life User study
  • Figure on the left is Average Normalized Cost
    (items examined by user/ relevant tuple found)
    of various techniques
  • Figure on the right is Average Cost (till she
    finds the first relevant tuple found) of various
    techniques

41
Experimental EvaluationReal Life User study
  • Figure on the left Results of the post study
    survey
  • Figure on the right Average execution time of the
    cost
  • based categorization algorithm.

42
Conclusion
  • This paper gives a solution for the problem
    Information
  • Overload by purposing the automatic
    categorization of Query
  • results. The solution is to dynamically generate
    a labeled,
  • hierarchical category structure the user can
    determine
  • whether a category is relevant or not by
    examining simply its
  • label and explore only the relevant categories,
    thereby
  • reducing information overload.

43
Thank You
Write a Comment
User Comments (0)
About PowerShow.com