Title: Context-Aware Query Classification
1Context-Aware Query Classification
- Huanhuan Cao1, Derek Hao Hu2, Dou Shen3,
- Daxin Jiang4 , Jian-Tao Sun4 , Enhong Chen1 and
Qiang Yang2 - 1University of Science and Technology of China,
- 2Hong Kong University of Science and Technology,
- 3Microsoft Corporation
- 4Microsoft Research Asia
2Motivation
- Understanding Web user's information need is one
of the most important problems in Web search. - Such information could generally help improving
the quality of many Web search services such as - Ranking
- Online advertising
- Query suggestion, etc.
3Challenges
- The main challenges of query classification
- Lack of feature information
- Ambiguity
- Multiple intents
- The first problem has been studied widely
- Query expansion by top search results
- Leverage a web directory
- However, the second and the third problems are
far away from being closed.
4Why context is useful?
- Context means the previous queries and clicked
URLs in the same session given a query. - Its assumed that
- Context has semantic relation with the current
query. - Context may help to label appropriate categories
for current query. - It makes sense to exploit context for specifying
the current query.
5Example
6Example
7Example
8Overview
- Problem statement
- Model query context by CRF
- Features of CRF
- Experiment
- Conclusion and future work
9Problem Statement Context
- In a user search session, suppose the user has
raised a series of queries as q1q2qT-1 and
clicked some returned URLs U1U2UT-1 - If the user raises a query qT at time T, we call
q1q2qT-1 and U1U2UT-1 as query context of qT - And we call qt t (t ? 1, T - 1) as contextual
queries of qT .
10Query Context
Query Context of Q_T
11Problem Statement QC with context and Taxonomy
- The objective of query classification (QC) with
context is to classify a user query qT into a
ranked list of K categories cT1, cT2, ..., cTK,
among Nc categories c1,c2,,cNc, given the
context of qT . - A target taxonomy ? is a tree of categories where
c1,c2,,cNc are leaf nodes of this tree.
12Modeling Query Context by CRF
-
where q represents q1q2qt
13Why CRF?
- The two main advantages of CRF are
- 1) It can incorporate general feature functions
to model the relation between observations and
unobserved states - 2) It doesn't need prior knowledge of the type of
conditional distribution. - Given 1), we can incorporate some external web
knowledge. - Given 2), we dont need any assumptions of the
type of p(cq).
14Features of CRF
- When we use CRF to model query context, one of
the most important part is to choose effective
feature functions. - We should consider
- Relevance between queries and category labels
for leveraging local information of queries - Relevance between adjacent labels for leveraging
contextual information.
15Relevance between queries and category labels
- Term occurrence
- The terms of qt are obvious features for
supporting ct - Due to the limited size of training data, many
useful terms indicating category information may
be uncovered. - General label confidence
- Leverage an external web directory such as
Google Directory - where M
means the number of - returned results and Mct,qt means the
number of returned results with label ct after
mapping.
16Relevance between queries and category labels
- Click-aware label confidence
- Combining the click-information with the
knowledge of a external web directory -
- CConf(ct ,ut) can be calculated by multiple
approaches. - Here, we use VSM to calculate cosine similarity
between term vectors of ct and ut
17Relevance between Adjacent Labels
- Direct relevance between adjacent labels
- Occurrence of adjacent label pair ltct-1,ctgt
- The weight implies how likely the two labels
co-occur - Taxonomy based relevance between adjacent labels
- Limited by the sampling approach and size of the
training data, some reasonable adjacent label
pairs may not occur proportionally or even not
occur at all. - Consider indirect relevance between adjacent
labels by considering the taxonomy.
18Experiment
- Data set
- 10,000 random selected sessions from one days
search log of a commercial search engine. - Three labelers firstly label all possible
categories with KDDCUP05 taxonomy for each
unique query of the training data.
19Examples of multiple category queries
A large ratio of multiple category queries
implies the difficulty of QC without context.
20Label Sessions
- Then the three human labelers are asked to cross
label each session of the data set with a
sequence of level-2 category labels. - For each query, a labeler gives a most
appropriate category label by considering - Query itself
- The query context
- Clicked URLs of the query.
21Tested Approaches
- Baselines
- Non context-aware baseline Bridging
classifier(BC) proposed by Shen et al. - Naïve context-aware baseline Collaborating
classifier(CC). Combine a test query and the
previous query to classify with BC. - CRFs
- CRF-B CRF with basic features including term
occurrence, general label confidence and direct
relevance between adjacent labels) - CRF-B-C CRF with basic features click-aware
label confidence) - CRF-B-C-T CRF with basic features click-aware
label confidence taxonomy based relevance)
22Evaluation Metrics
- Given a test session q1q2qT, we let the qT be
the test query and let queries q1q2qT-1 and
corresponding clicked URL sets U1U2UT-1 be the
query context. - For qT ,we evaluate a tested approach by
- Precision(P) d(cT ? CT,K)/K
- Recall(R) d(cT ? CT,K)
- F1 score(F1 ) 2PR/(PR)
- Where cT means the ground truth label and CT,K
means a set of the top K labels. d() is a
Boolean function of indicating whether is true
(1) or false (0).
23Overall results
1) The naïve context-aware baseline consistently
outperforms the non context-aware baseline. 2)
CRFs consistently outperform the two
baselines. 3) CRF-B-C-T gt CRF-B-C gt CRF-B click
information and taxonomy based relevance are
useful.
24Case study
Context about travel
Click a travel guide web page
Give the most appropriate label in the first
position
25Efficiency of Our Approach
- Offline training
- Each iteration takes about 300ms
- Time cost of training a CRF is acceptable
- Online cost
- Calculating features
- Label confidence
26Conclusion and Future work
- In this paper, we propose a novel approach for
query classification by modeling query context
via CRFs. - Experiments on a real search log clearly show
that our approach outperforms a non context-aware
baseline and a naive context-aware baselines. - Current approach cannot leverage the contextual
information of the beginning queries of sessions,
which make us carry on our following researches
for leveraging more contextual information out of
sessions.
27