Title: A Review of Information Filtering Part I: Adaptive Filtering
1A Review of Information FilteringPart I
Adaptive Filtering
Chengxiang Zhai Language Technologies
Institiute School of Computer Science Carnegie
Mellon University
2Outline
- The Problem of Adaptive Information Filtering
(AIF) - The TREC Work on AIF
- Evaluation Setup
- Main Approaches
- Sample Results
- The Importance of Learning
- Summary Research Directions
3Adaptive Information Filtering (AIF)
- Dynamic information stream
- (Relatively) stable user interest
- System blocks non-relevant information
according to users interest - User provides feedback on the received items
- System learns from users feedback
- Performance measured by the utility of the
filtering decisions
4A Typical AIF Application News Filtering
- Given a news stream and users
- Each user expresses interest by a text query
- For each news article, system makes a yes/no
filtering decision for each user interest - User provides feedback on the received news
- System learns from feedback
- Utility 3Good - 2 Bad
5AIF vs. Retrieval, Categorization, Topic tracking
etc.
- AIF is like retrieval over a dynamic stream of
information items, but ranking is impossible - AIF is like online binary categorization without
initial training data and with limited feedback - AIF is like tracking user interest over a news
stream
6Evaluation of AIF
- Primary measure linear utility (-gtprob. cut)
- E.g., used in
TREC7 8
used in TREC9 - Problems with the linear utility
- Unbounded
- Not comparable across topics/profiles
- Average utility may be dominated by one topic
7Other Measures
- Nonlinear utility (e.g., early relevant doc is
worth more) - Normalized utility
- More meaningful for averaging
- But can be inversely correlated with
precision/recall! - Other measures that reflect a trade-off between
precision and recall
8A Typical AIF System
User profile text
Initialization
Accepted Docs
Binary Classifier
...
User
User Interest Profile
Doc Source
utility func
9Three Basic Problems in AIF
- Making filtering decision (Binary classifier)
- Doc text, profile text ? yes/no
- Initialization
- Initialize the filter based on only the profile
text or very few examples - Learning from
- Limited relevance judgments (only on yes docs)
- Accumulated documents
- All trying to maximize the utility
10The TREC Work on AIF
- The Filtering Track of TREC
- Major Approaches to AIF
- Sample Results
11The Filtering Track (TREC7, 8, 9)(Hull 99, Hull
Robertson 00, Robertson Hull 01)
- Encourage development and evaluation of
techniques for text filtering - Tasks
- Adaptive filtering (start with little/none
training, online filtering with limited feedback) - Batch filtering (start with many training
examples, online filtering with limited feedback)
- Routing (start with many training examples,
ranking test documents)
12AIF Evaluation Setup
- TREC7 LF1, LF3 utility functions
- AP88--90 50 topics
- No training initially
- TREC8 LF1, LF2 utility functions
- Financial Times 92-94 50 topics
- No training initially
- TREC9 T9U, Precision_at_50, etc
- OHSUMED 63 original topics 4903 MeSH topics
- 2? initial (positive) training examples available
13Major Approaches to AIF
- Extended retrieval systems
- Reuse retrieval techniques to score documents
- Use a score threshold for filtering decision
- Learn to improve scoring with traditional
feedback - New approaches to threshold setting and learning
- Modified categorization systems
- Adapt to binary, unbalanced categorization
- New approaches to initialization
- Train with censored training examples
14A General Vector-Space AIF Approach
no
doc vector
Utility Evaluation
Scoring
Thresholding
yes
profile vector
threshold
Vector Learning
Threshold Learning
Feedback Information
15Extended Retrieval Systems
- City Univ./MicroSoft (Okapi) Prob. IR
- Univ. of Massachusetts (Inquery) Infer. Net.
- Queens College, CUNY (Pirc) Prob. IR
- Clairvoyance Corp. (Clarit) Vector Space
- Univ. of Nijmegen (KUN) Vector Space
- Univ. of Twente (TNO) Language Model
- And many others ...
16Threshold Setting in Extended Retrieval Systems
- Utility-independent approaches (generally not
working well, not covered in this talk) - Indirect (linear) utility optimization
- Logistic regression (score-gtprob. of relevance)
- Direct utility optimization
- Empirical utility optimization
- Expected utility optimization given score
distributions - All try to learn the optimal threshold
17Difficulties in Threshold Learning
- Censored data
- Little/none labeled data
- Scoring bias due to vector learning
36.5 R 33.4 N 32.1 R 29.9 ? 27.3
? ...
?30.0
18Logistic Regression
- General idea convert score of D to p(RD)
- Fit the model using feedback data
- Linear utility is optimized with a fixed prob.
cutoff - But,
- Possibly incorrect parametric assumptions
- No positive examples initially
- Censored data and limited positive feedback
19Logistic Regression in Okapi(Robertson Walker
2000)
- Motivation Recover probability of relevance from
the original prob. IR model - Need to estimate ?, ?, and ast1 (avg. score of
top 1 docs) - All topics share the same ?, which is initially
set and never updated
20Logistic Regression in Okapi(cont.)
- Initially, all topics share the same ?, ?, and
ast1 is estimated with a linear regression
ast1 a1 a2 maxscore - After one week, ast1 is estimated based on the
documents available from the week. - Threshold learning
- ? is fixed all the time
- ? is updated with gradient descent
- heuristic ladder is used to allow exploration
21Logistic Regression in Okapi(cont.)
- Pros
- Well-motivated method for the Okapi system
- Based on principled approach
- Cons
- Limited adaptation
- Exploration is ad hoc (over-explore initially)
- Some nonlinear utility may not correspond to a
fixed probability cutoff
22Direct Utility Optimization
- Given
- A utility function U(CR ,CR- ,CN ,CN-)
- Training data Dltsi, R,N,?gt
- Formulate utility as a function of the threshold
and training data UF(?,D) - Choose the threshold by optimizing F(?,D), i.e.,
23Empirical Utility Optimization
- Basic idea
- Compute the utility on the training data for each
candidate threshold (score of a training doc) - Choose the threshold that gives the maximum
utility - Difficulty Biased training sample!
- We can only get an upper bound for the true
optimal threshold. - Solutions
- Heuristic adjustment(lowering) of threshold
- Lead to beta-gamma threshold learning
24The Beta-Gamma Threshold Learning Method in
CLARIT(zhai et al. 00)
- Basic idea
- Extend the empirical utility optimization method
by putting a lower bound on the threshold. - ? is to correct score bias
- ? is to control exploration
- ?, ? are relatively stable and can be tuned based
on independent data - Can optimize any utility function (with
appropriate zero utility )
25Illustration of Beta-Gamma Threshold Learning
26Beta-Gamma Threshold Learning (cont.)
- Pros
- Explicitly addresses exploration-exploitation
tradeoff (Safe exploration) - Arbitrary utility (with appropriate lower bound)
- Empirically effective and robust
- Cons
- Purely heuristic
- Zero utility lower bound often too conservative
27Score Distribution Approaches( Aramptzis
Hameren 01 Zhang Callan 01)
- Assume generative model of scores p(sR), p(sN)
- Estimate the model with training data
- Find the threshold by optimizing the expected
utility under the estimated model - Specific methods differ in the way of defining
and estimating the scoring distributions
28A General Formulation of Score Distribution
Approaches
- Given p(R), p(sR), and p(sN), EU for sample
size n, is a function of ? and n, I.e., EUF(n,
?) - The optimal threshold for sample size n is
29Solution for Linear Utility Continuous p(sR)
p(sN)
- Linear utility
- The optimal threshold is the solution to the
following equation (independent of n)
30Gaussian-Exponential Distributions
- P(sR) N(?,?2) p(s-s0N) E(?)
(From Zhang Callan 2001)
31Optimal Threshold for Gaussian-Exp.
Distributions
32Parameter Estimation in KUN (Aramptzis Hameren
01)
- ?, ?2 estimated using ML on rel. docs
- ? estimated using top 50 non-rel. docs
- Some recent improvement
- Compute p(s) based on p(wi)
- Initial distribution q as the only rel doc.
- Soft probabilistic threshold, sampling with p(Rs)
33Maximum Conditional Likelihood (Zhang Callan 01)
- Explicitly modeling of censored data
- Data ltsi, ri,?igt ri? R,N,
- Maximizing
- Conjugate Gradient Descent
- Prior is introduced for smoothing (making it
Bayesian?) - Minimum delivery ratio used to ensure
exploration
34Score Distribution Approaches (cont.)
- Pros
- Principled approach
- Arbitrary utility
- Empirically effective
- Cons
- May be sensitive to the scoring function
- Exploration not addressed
35Modified Categorization Methods
- Mostly applied to batch filtering, or routing and
sometimes combined with Rocchio - K-Nearest Neighbor (CMU)
- Naïve Bayes (Seoul)
- Neural Network (ICDC, DSO, IRIT)
- Decision Tree (NTT)
- Only K-Nearest Neighbor was applied to AIF (CMU)
- With special thresholding strategies
36The State of the Art Performance
- For high-precision utilities, system can hardly
beat the zero-return baseline! (I.e., negative
utility) - Direct/indirect utility optimization methods
generally performed much better than
utility-independent tuning of threshold - Hard to compare different threshold learning
methods, due to too many other factors (e.g.,
scoring, etc)
37- TREC7
- No initial example
- No system beats the zero-return baseline for F1
(prgt0.4) - Several systems beat the zero-return baseline for
F3 (prgt0.2)
(from Hull 99)
38- TREC7
- Learning effect is clear in some systems
- But, stream is not long enough for systems to
benefit from learning
(from Hull 99)
39- TREC8
- Again, learning effect is clear
- But, systems still couldnt beat the zero-return
baseline!
(from Hull Robertson 00)
40- TREC9
- 2 initial examples
- Amplifying learning effect
- T9U (prob gt0.33)
- Systems clearly beat the zero-return baseline!
(from Robertson Hull 01)
41The Importance of Learning in AIF(Results from
Zhai et al. 00)
- Learning and initial inaccuracies Learning
compensates for initial inaccuracies - Exploitation vs. exploration Exploration
(lowering threshold) pays off in the long run
score
ideal adaptive
ideal fixed
actual adaptive
actual fixed
time
42Learning Effect 1 Correction of Inappropriate
Initial Threshold Setting
bad initial threshold without updating
bad initial threshold with updating
43Learning Effect 2 Early Exploration Pays Off
44Learning Effect 3 Regular Exploration Pays Off
Later
45Tradeoff between Exploration and Exploitation
under-explore
over-explore
46Summary
- AIF is a very interesting and challenging online
learning problem - As a learning task, it has extremely sparse
training data - Initially no training data
- Later, limited and censored training examples
- Practically, learning must also be efficient
47Summary(cont.)
- Evaluation of AIF is challenging
- Good performance (utility) is achieved by
- Direct/indirect utility optimization
- Learning the optimal score threshold from
feedback - Appropriate tradeoff between exploration and
exploitation - Several different threshold methods can all be
effective
48Research Directions
- Threshold learning
- Non-parametric score density estimation?
- Controlled comparison of threshold methods
- Integrated AIF model
- Bayesian decision theory EM?
- Exploration-exploitation tradeoff
- Reinforcement learning?
- User model evaluation measures
- Users care about more factors than the linear
utility - Users interest may drift over time
- Redundancy reduction novelty detection
49References
- General papers on TREC filtering evaluation
- D. Hull, The TREC-7 Filtering Track Description
and Analysis , TREC-7 Proceedings. - D. Hull and S. Robertson, The TREC-8 Filtering
Track Final Report, TREC-8 Proceedings. - S. Robertson and D. Hull, The TREC-9 Filtering
Track Final Report, TREC-9 Proceedings. - Papers on specific adaptive filtering methods
- Stephen Robertson and Stephen Walker (2000),
Threshold Setting in Adaptive Filtering . Journal
of Documentation, 56312-331, 2000 - Chengxiang Zhai, Peter Jansen, and David A.
Evans, Exploration of a heuristic approach to
threshold learning in adaptive filtering, 2000
ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR'00), 2000. Poster
presentation. - Avi Arampatzis and Andre van Hameren The
Score-Distributional Threshold Optimization for
Adaptive Binary Classification Tasks ,
SIGIR'2001. - Yi Zhang and Jamie Callan, 2001, Maximum
Likelihood Estimation for Filtering Threshold,
SIGIR 2001.
50The End Thank you!