Title: Context-Aware Query Classification
 1Context-Aware Query Classification 
- Huanhuan Cao1, Derek Hao Hu2, Dou Shen3, 
- Daxin Jiang4 , Jian-Tao Sun4 , Enhong Chen1 and 
 Qiang Yang2
- 1University of Science and Technology of China, 
- 2Hong Kong University of Science and Technology, 
- 3Microsoft Corporation 
- 4Microsoft Research Asia 
2Motivation
- Understanding Web user's information need is one 
 of the most important problems in Web search.
- Such information could generally help improving 
 the quality of many Web search services such as
-  Ranking 
-  Online advertising 
-  Query suggestion, etc.
3Challenges
- The main challenges of query classification 
- Lack of feature information 
- Ambiguity 
- Multiple intents 
- The first problem has been studied widely 
- Query expansion by top search results 
- Leverage a web directory 
- However, the second and the third problems are 
 far away from being closed.
4Why context is useful?
- Context means the previous queries and clicked 
 URLs in the same session given a query.
- Its assumed that 
- Context has semantic relation with the current 
 query.
- Context may help to label appropriate categories 
 for current query.
- It makes sense to exploit context for specifying 
 the current query.
5Example 
 6Example 
 7Example 
 8Overview 
- Problem statement 
- Model query context by CRF 
- Features of CRF 
- Experiment 
- Conclusion and future work
9Problem Statement Context
- In a user search session, suppose the user has 
 raised a series of queries as q1q2qT-1 and
 clicked some returned URLs U1U2UT-1
- If the user raises a query qT at time T, we call 
 q1q2qT-1 and U1U2UT-1 as query context of qT
- And we call qt t (t ? 1, T - 1) as contextual 
 queries of qT .
10Query Context
Query Context of Q_T 
 11Problem Statement QC with context and Taxonomy
- The objective of query classification (QC) with 
 context is to classify a user query qT into a
 ranked list of K categories cT1, cT2, ..., cTK,
 among Nc categories c1,c2,,cNc, given the
 context of qT .
- A target taxonomy ? is a tree of categories where 
 c1,c2,,cNc are leaf nodes of this tree.
12Modeling Query Context by CRF
-  
 where q represents q1q2qt
13Why CRF?
- The two main advantages of CRF are 
- 1) It can incorporate general feature functions 
 to model the relation between observations and
 unobserved states
- 2) It doesn't need prior knowledge of the type of 
 conditional distribution.
- Given 1), we can incorporate some external web 
 knowledge.
- Given 2), we dont need any assumptions of the 
 type of p(cq).
14Features of CRF
- When we use CRF to model query context, one of 
 the most important part is to choose effective
 feature functions.
- We should consider 
-  Relevance between queries and category labels 
 for leveraging local information of queries
- Relevance between adjacent labels for leveraging 
 contextual information.
15Relevance between queries and category labels
- Term occurrence 
- The terms of qt are obvious features for 
 supporting ct
- Due to the limited size of training data, many 
 useful terms indicating category information may
 be uncovered.
- General label confidence 
-  Leverage an external web directory such as 
 Google Directory
-  where M 
 means the number of
-  returned results and Mct,qt means the 
 number of returned results with label ct after
 mapping.
16Relevance between queries and category labels
- Click-aware label confidence 
- Combining the click-information with the 
 knowledge of a external web directory
-  
 
- CConf(ct ,ut) can be calculated by multiple 
 approaches.
- Here, we use VSM to calculate cosine similarity 
 between term vectors of ct and ut
17Relevance between Adjacent Labels
- Direct relevance between adjacent labels 
- Occurrence of adjacent label pair ltct-1,ctgt 
- The weight implies how likely the two labels 
 co-occur
- Taxonomy based relevance between adjacent labels 
- Limited by the sampling approach and size of the 
 training data, some reasonable adjacent label
 pairs may not occur proportionally or even not
 occur at all.
- Consider indirect relevance between adjacent 
 labels by considering the taxonomy.
18Experiment
- Data set 
- 10,000 random selected sessions from one days 
 search log of a commercial search engine.
- Three labelers firstly label all possible 
 categories with KDDCUP05 taxonomy for each
 unique query of the training data.
19Examples of multiple category queries
A large ratio of multiple category queries 
implies the difficulty of QC without context.  
 20Label Sessions
- Then the three human labelers are asked to cross 
 label each session of the data set with a
 sequence of level-2 category labels.
- For each query, a labeler gives a most 
 appropriate category label by considering
- Query itself 
-  The query context 
-  Clicked URLs of the query. 
21Tested Approaches
- Baselines 
- Non context-aware baseline Bridging 
 classifier(BC) proposed by Shen et al.
- Naïve context-aware baseline Collaborating 
 classifier(CC). Combine a test query and the
 previous query to classify with BC.
- CRFs 
- CRF-B CRF with basic features including term 
 occurrence, general label confidence and direct
 relevance between adjacent labels)
- CRF-B-C CRF with basic features  click-aware 
 label confidence)
- CRF-B-C-T CRF with basic features  click-aware 
 label confidence  taxonomy based relevance)
22Evaluation Metrics
- Given a test session q1q2qT, we let the qT be 
 the test query and let queries q1q2qT-1 and
 corresponding clicked URL sets U1U2UT-1 be the
 query context.
- For qT ,we evaluate a tested approach by 
- Precision(P) d(cT ? CT,K)/K 
- Recall(R) d(cT ? CT,K) 
- F1 score(F1 ) 2PR/(PR) 
- Where cT means the ground truth label and CT,K 
 means a set of the top K labels. d() is a
 Boolean function of indicating whether  is true
 (1) or false (0).
23Overall results
1) The naïve context-aware baseline consistently 
outperforms the non context-aware baseline. 2) 
CRFs consistently outperform the two 
baselines. 3) CRF-B-C-T gt CRF-B-C gt CRF-B click 
information and taxonomy based relevance are 
useful. 
 24Case study
Context about travel
Click a travel guide web page
Give the most appropriate label in the first 
position 
 25Efficiency of Our Approach
- Offline training 
- Each iteration takes about 300ms 
- Time cost of training a CRF is acceptable 
- Online cost 
- Calculating features 
- Label confidence
26Conclusion and Future work
- In this paper, we propose a novel approach for 
 query classification by modeling query context
 via CRFs.
- Experiments on a real search log clearly show 
 that our approach outperforms a non context-aware
 baseline and a naive context-aware baselines.
- Current approach cannot leverage the contextual 
 information of the beginning queries of sessions,
 which make us carry on our following researches
 for leveraging more contextual information out of
 sessions.
27