Title: Predicting%20Diverse%20Subsets%20Using%20Structural%20SVMs
1Predicting Diverse Subsets Using Structural SVMs
- ICML 2008
- Yisong Yue
- Cornell University
- In collaboration with
- Thorsten Joachims (Cornell University)
2Query Jaguar
Top of First Page
Bottom of First Page
Result 18
Results From 11/27/2007
3Need for Diversity (in IR)
- Ambiguous Queries
- Users with different information needs issuing
the same textual query - Jaguar
- At least one relevant result for each information
need - Learning Queries
- User interested in a specific detail or entire
breadth of knowledge available - Swaminathan et al., 2008
- Results with high information diversity
4Learning to Rank
- Current methods
- Real valued retrieval functions f(q,d)
- Sort by f(q,di) to obtain ranking
- Benefits
- Know how to perform learning
- Can optimize for rank-based performance measures
- Outperforms traditional IR models
- Drawbacks
- Cannot account for diversity
- During prediction, considers each document
independently
5Optimizing Diversity
- Interest in information retrieval
- Carbonell Goldstein, 1998 Zhai et al., 2003
Zhang et al., 2005 Chen Karger, 2006
Swaminathan et al., 2008 - Requires inter-document dependencies
- Impossible given current learning to rank methods
- No consensus on how to measure diversity.
6Contribution
- Formulate as predicting diverse subsets
- Use training data with explicitly labeled
subtopics (TREC 6-8 Interactive Track) - Use loss function to encode subtopic loss
- Perform training using structural SVMs
- Tsochantaridis et al., 2005
7Representing Diversity
- Current datasets use manually determined subtopic
labels - E.g., Use of robots in the world today
- Nanorobots
- Space mission robots
- Underwater robots
- Manual partitioning of the total information
regarding a query - Relatively reliable
8Example
- Choose K documents with maximal information
coverage. - For K 3, optimal set is D1, D2, D10
9Maximizing Subtopic Coverage
- Goal select K documents which collectively cover
as many subtopics as possible. - Perfect selection takes n choose K time.
- Basically a set cover problem.
- Greedy gives (1-1/e)-approximation bound.
- Special case of Max Coverage (Khuller et al, 1997)
10Weighted Word Coverage
- More distinct words more information
- Weight word importance
- Does not depend on human labels
- Goal select K documents which collectively cover
as many distinct (weighted) words as possible - Greedy selection also yields (1-1/e) bound.
- Need to find good weighting function (learning
problem).
11Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
12Example
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Marginal Benefit
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
13 Related Work Comparison
- Essential Pages Swaminathan et al., 2008
- Uses fixed function of word benefit
- Depends on word frequency in candidate set
- Our method directly learns word benefit function
- Feature space based on word frequency
- Optimizes for subtopic loss
- First learning method for optimizing subtopic
loss - (to our knowledge)
14Linear Discriminant
- x (x1,x2,,xn) - candidate documents
- y subset of x
- V(y) union of words from documents in y.
- Discriminant Function
- ?(v,x) frequency features (e.g., 10, 20,
etc). - Benefit of covering word v is then wT?(v,x)
15Linear Discriminant
- Does NOT reward redundancy
- Benefit of each word only counted once
- Greedy has (1-1/e)-approximation bound
- Linear (joint feature space)
- Allows for SVM optimization
16More Sophisticated Discriminant
- Documents cover words to different degrees
- A document with 5 copies of Thorsten might
cover it better than another document with only 2
copies. - Use multiple word sets, V1(y), V2(y), , VL(y)
- Each Vi(y) contains only words satisfying certain
importance criteria.
17More Sophisticated Discriminant
- Separate ?i for each importance level i.
- Joint feature map ? is vector composition of all
?i
- Greedy has (1-1/e)-approximation bound.
- Still uses linear feature space.
18Conventional SVMs
- Input x (high dimensional point)
- Target y (either 1 or -1)
- Prediction sign(wTx)
- Training
- subject to
- The sum of slacks upper bounds the
accuracy loss
19Adapting to Predicting Subsets
- Input x (candidate set of documents)
- Target y (subset of x of size K)
- Same objective function
- Constraints for each incorrect labeling y.
- Score of best y at least as large as incorrect y
plus loss - Weighted subtopic loss (later)
-
20Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
21Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
22Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
23Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
24Weighted Subtopic Loss
- Example
- x1 covers t1
- x2 covers t1,t2,t3
- x3 covers t1,t3
- Motivation
- Higher penalty for not covering popular subtopics
- Mitigates effects of label noise in tail
subtopics
Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
25TREC Experiments
- TREC 6-8 Interactive Track Queries
- Documents labeled into subtopics.
- 17 queries used,
- considered only relevant docs
- decouples relevance problem from diversity
problem - 45 docs/query, 20 subtopics/query, 300 words/doc
26TREC Experiments
- 12/4/1 train/valid/test split
- Approx 500 documents in training set
- Permuted until all 17 queries were tested once
- Set K5 (some queries have very few documents)
- SVM-div uses term frequency thresholds to
define importance levels - SVM-div2 in addition uses TFIDF thresholds
27TREC Results
Method Loss
Random 0.469
Okapi 0.472
Unweighted Model 0.471
Essential Pages 0.434
SVM-div 0.349
SVM-div2 0.382
Methods W / T / L
SVM-div vs Ess. Pages 14 / 0 / 3
SVM-div2 vs Ess. Pages 13 / 0 / 4
SVM-div vs SVM-div2 9 / 6 / 2
28Can expect further benefit from having more
training data.
29Consistently outperforms Essential Pages
30Summary
- Formulated diversified retrieval as predicting
diverse subsets - Efficient training and prediction algorithms
- Used weighted word coverage as proxy to
information coverage. - Encode diversity criteria using loss function
- Weighted subtopic loss
http//projects.yisongyue.com/svmdiv/
31Extra Slides
32Essential Pages
33Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
Swaminathan et al., 2008
34Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
- Intuition
- Frequent words cannot encode information
diversity. - Infrequent words do not provide significant
information
Swaminathan et al., 2008
35Essential Pages
x (x1,x2,,xn) - set of candidate documents for
a query y a subset of x of size K (our
prediction).
Benefit of covering word v with document
xi Importance of covering word v
- Intuition
- Frequent words cannot encode information
diversity. - Infrequent words do not provide significant
information
Swaminathan et al., 2008
36Finding Most Violated Constraint
37Finding Most Violated Constraint
- A constraint is violated when
- Finding most violated constraint reduces to
38Finding Most Violated Constraint
- Encode each subtopic as an additional word to
be covered. - Use greedy prediction to find approximate most
violated constraint.
39Approximate Constraint Generation
- Theoretical guarantees no longer hold.
- Might not find an epsilon-close approximation to
the feasible region boundary. - Performs well in practice.
40Approximate constraint generation seems to work
perform well.
41Synthetic Dataset
42Synthetic Dataset
- Trec dataset very small
- Synthetic dataset so we can vary retrieval size K
- 100 queries
- 100 docs/query, 25 subtopics/query, 300 words/doc
- 15/10/75 train/valid/test split