Title: Adaptive Information Retrieval
1Adaptive Information Retrieval
- Eren Manavoglu1
- Advisor
- Dr. C. Lee Giles1,2
- 1Department of Computer Science and Engineering
- The Pennsylvania State University, University
Park, PA, 16802 - 2School of Information Sciences and Technology
- The Pennsylvania State University, University
Park, PA, 16802
2Information Retrieval over the Web Today
- Main tools Search engines
- User input query terms/questions
- Result presentation ranked list of matching
documents - Ranking algorithm a combination of topical
relevancy, citations and document popularity
(probably)
3Why Do We Need Adaptive Information Retrieval?
- Different users use the same query to access
different information. - Jaguar animal or automobile?
- Even the same user may have different information
needs at different points in time about the same
topic. - Dick Cheney Scooter scandal last month, Hunting
accident today.
4Adaptive Information Retrieval Techniques
- Personalization
- Personalization on server side Relies on
explicit user input mainly (MyYahoo, Personalized
Google) and the parameters of the ranking
algorithm is personalized - Personalization on client side Uses both
implicit and explicit user feedback. The user
model is usually in the form of interest
profile(s) - Recommender Systems
- Will be discussed in more detail later
- Clustering search results
- Will be discussed in more detail later
A survey can be found in Manavoglu, Giles,
Sping, Wang, The Power of One - Gaining Business
Value from Personalization Technologies, Chapter
9
5What this Proposal is about
- Learn the user behavior models to capture
individual differences - Group the documents at query time to capture the
current state of available information sources - Use the user model to highlight or recommend
groups of documents of interest to him/her
6Work Done So Far
- Recommender Systems
- User Behavior Modeling
- Clustering Search Results
7Recommender Systems
- Goal To automate the word-of-mouth
recommendations within a community. - Approaches
- Content-based filtering Recommendations are done
solely based on the content of the items (e.g.
topic, author, etc.) - Collaborative filtering Similar users are
identified based on the items they view and then
the items viewed by like-minded users are
recommended to a given user Resnick 94 - Hybrid Both content of the items and the
similarity of the users are used in making
recommendations Shani 2005
8Recommender Systems -A Probablistic Approach
- Represent the data as a collection of ordered
sequences of document requests - Assume that we are given a dataset consisting of
ordered sequences in some fixed alphabet. The
elements of the alphabet are documents. - For each document define its history H as a
so-far observed subsequence of the current
sequence - Our goal is to predict the next document Dnext
given the history H - Learn a probabilistic model P(Dnext H, Data)
9Recommender Systems - Probabilistic Models
- A mixture model based approach could be taken for
learning the probabilistic model P(Dnext H,
Data) - Nc is the number of components, P(Dnext
H,Data,k) is the component model - Component distribution, P(Dnext H,Data,k),
should make use of the sequence information and
the learning should be scalable - Mixture of multinomials, mixture of markov models
10Proposed Recommendation Model - Maximum Entropy
(maxent)
- Maximum entropy provides a framework to combine
information from different knowledge sources. - Each knowledge source imposes a set of
constraints. - Intersection of all constraints contains a set of
probability functions. - Maximum entropy principle chooses the one with
the highest entropy, i.e. most flat function. - Advantage of maxent approach
- Combine 1st-order Markov model features and long
term dependencies among the data.
11Representation for Long Term Dependence - Triggers
- Long term dependence of Dnext on H is represented
with triggers. - The pair of actions (a,b) is a trigger iff
- is substantially different from P(Dnextb).
- Triggers are pairs with high mutual information
scores.
12Maximum Entropy Model Definition
- Maximum entropy objective function leads to the
following model
- Fs(D,H) are the feature indicator functions. S is
the total number of bigramstriggers for document
D - s are maxent model parameters for component k
- If s corresponds to bigram (a,b) then Fs(D,H)1
iff Da and Dprevb, and Fs(D,H)0 otherwise - Z?,k(H) is a normalization constant
13Maximum Entropy Model -A Complication and its
Solution
- Maxent computation is expensive if the number of
items is large. On a dataset of 500K items it
could take months - This problem can be solved if the items are
clustered before applying the maxent model - We employed a user navigation based clustering
algorithm using this simple idea documents that
are requested consecutively are related - Compute the document bigrams
- Apply greedy clustering using the bigram counts
Details of the algorithm can be found in
Pavlov, Manavoglu, Giles, Pennock, Giles 2004
14Experimental Evaluation
- Competing Models
- Maxent
- 1st Order Markov model
- Mixture of Markov models (30-components)
- Mixture of Multinomial (60-components, with the
similarity based CiteSeer predictors as its
feature set) - Correlation
- Individual CiteSeer similarity-based predictors
- CiteSeer merged-similarity-based predictor
- We call ResearchIndex Merge the recommender that
is obtained by pulling all similarity-based
recommenders togetheressentially, this
corresponds to the currently available
recommending system in CiteSeer.
15Current Recommendation Algorithms in CiteSeer
- Similarity recommenders
- Content-based text or citation similarity
- Sentence Similarity, Text Similarity, Active
Bibliography, Cited By, Co-citations, Cites - Collaborative user similarity
- Users Who Viewed
- Based on source similarity
- On Same Site
For detailed information on CiteSeer
recommenders please see Lawrence, Giles,
Bollacker, 1999.
16Experimental Setup
- Conducted on 6 months worth of CiteSeer data in
the form of lt time, user, request gt - Chronologically partitioned the data into 5
million training and 1.7 million test requests - Document requests were aggregated by userID and
sessionized. An inactivity time of 300 seconds
was used to break the requests into sessions - Robots were identified and eliminated based on
the number of requests submitted - Documents that appeared less than 3 times were
discarded (250,000 documents remained)
17Evaluation Metrics
- Hit ratio the ratio of correct predictions to
the total number of predictions made - Average height of prediction Predictions are a
list of ranked actions, where the ranking is done
by ordering the actions by their probability
values, thus average height of prediction is the
average height of the hit within this list - First N predictions in the list are considered
and the performance is evaluated based on the
success of these N predictions, for N1,,5,10
(called bins hereafter)
18Results
- Among the individual CiteSeer recommenders
Active Bibliography was the best predictor.
However the merged predictor performed better
than Active Bibliography hence it will represent
the CiteSeer recommender in the remainder of the
experiments.
19Results
Although merged CiteSeer predictor does better
for longer recommendation lists, maxent is the
best for small number of guesses. Correlation
improves as the recommendation list gets
longer Also presented is the average height
results of the competing models. With respect to
average height maxent outperformed all the other
recommenders
20Conclusions - Discussion
- Maxent proved to be one of the best models
especially when the number of recommendations to
be made was limited - But clustering the documents is a crucial point
for its feasibility. - Most of the experiments done in the field of
recommender systems have been offline
experiments. However this could be misleading,
the effect of the recommendation on the user
should also be taken into account. We plan to
setup a medium for online experiments.
21User Behavior Models
22Motivation
- In the previous section we focused on only the
items users viewed. But users do much more than
just viewing items. They browse, query, look at
recommendations, etc. - Can we improve user experience by looking at
these other actions as well? - Can we detect what the users are trying to do in
a given session?
23User Behavior Modeling
- What data do we have about users?
- Web logs, in clickstream format
- How can we make use of this data?
- By applying web usage mining techniques
24Previous Work on Web Usage Mining
- Frequent item set identification by association
rule extraction methods1 - Collaborative Filtering for recommender
systems2,3 - Probabilistic graphical models (e.g. Bayesian
nets) to discover and represent dependencies
among different variables4 - Sequential pattern analysis to discover patterns
in time-ordered sessions5
1Liu et al, Integrating Classification and
Association Rule Mining. Fourth International
Conference on Knowledge Discovery and Data
Mining, 1998 2Resnick et al, GroupLens. An open
architecture for Collaborative Filtering of
Netnews. ACM 1994 Conference on Computer
Supported Cooperative Work. 3Pavlov, Manavoglu,
Pennock, Giles, Collborative Filtering with
Maximum Entropy. IEEE Intelligent Systens
2004 4D. Heckerman, Bayesian Networks for Data
Mining. Data Mining and Knowledge Discovery
1997 5Cadez et al, Predictive Profiles for
Transaction Data Using Finite Mixture Models. TR,
UC Irvine 2001
25Problem Definition
- We propose the following sequential framework for
modeling user behavior. - Each user session is represented as a sequence,
labeled with a user ID. - Each individual item in the sequence represents a
user action. - For each action, history H(U) is defined as the
so-far observed ordered sequence of actions. - P(AnextH(U),Data) is the behavior model for user
U that predicts the next action Anext, given the
history H(U).
User U, time t
User U, time t1
User U, time t2
Action At Query
Action At1 Browse
Action At2 Help
H(U) empty
H(U) At
H(U) AtAt1
26Problem Definition
- Problem
- To infer P(Anext H(U),Data) for each individual
given the training data. - Method Proposed
- Mixture Models
27Why Mixture Models?
- Mixture models can handle heterogeneous data and
provide a clustering framework. - Mixture models are weighted combinations of
component models. - Each component represents a dominant pattern in
the data. - A global mixture model captures the general
patterns among users. - A mixture model can be personalized to capture
individual behavior patterns.
Buyer Model
Browser Model
Customer A (Frequently visits, but rarely buys)
0.2Buyer Model 0.8Browser Model
McLachlan Basford 1988, Cadez 2001
28Mixture Models
- Global Mixture Models
- Parameters are same for all individuals. An Nc-
component mixture model
29Personalized Mixture Models
- Learn individual component probabilities, aU,ks
for each user. Resulting model is therefore
specific to each user.
- For component distribution P(AnextH(U),Data,k)
Markov and maximum entropy distributions are used.
30Parameter Estimation
- Learn the parameters of the global model by
expectation-maximization (EM) algorithm. - Fix the component distributions.
- Learn the individual component weights, aU,ks,
for known users by EM. - The aU,ks for users who do not appear in the
training data will be the global aks values.
31Visualization
- Each component of a mixture model can be viewed
as a cluster. - Each session is a weighted combination of these
clusters. - Probability of user session Su belonging to
cluster k is
Sample Cluster
- Each session is assigned to the cluster that
maximizes P(kSU,Data).
32Visualization - Graphs
- If number of actions is large, viewing and
interpreting sample sessions becomes impossible. - Build a graph for each cluster.
- Triggers and bigrams are computed and sorted for
each cluster - Pairs of actions gt threshold value are used to
build the graph - The sequence of actions defines the direction of
the edges
33Experimental Setup
- CiteSeer1 log files of approximately 2 months.
- Log files are series of transactions in the form
of lttime, action, user ID, related infogt. - lt1014151912.455 event authorhomepage 3491213
197248 - - ip 42779 agent Mozilla/2.0
(compatible MSIE 3.02 Update a Windows NT)gt - Returning users are identified by cookies.
- Actions are broken into sessions based on time
heuristics (300 seconds of inactivity indicates
the end of a session). - User1 Seesion1 37 31 34 31 32
- User1 Session2 37 31 34 31 34 33 34
- User2 Session1 31 34 33 20
- User2 Session2 33 20
- Robots are identified based on the number of
requests per session and are removed.
http//citeseer.ist.psu.edu
34CiteSeer user actions and descriptions used
during experiments
35Experiments
- User behavior models are evaluated based on the
accuracy of their predictions. - User behavior clusters are visualized to
demonstrate the descriptive ability of the
models. - For maxent models the history was empirically set
to the last 5 actions.
36Hit Ratio Results Returning Users
Hit ratio results on known users for 3 component
mixture model
- Personalized models outperformed global models.
- Personalization significantly improved for maxent
model. - Personalized Markov mixture is better for
returning users.
Hit ratio results on known users for 10 component
mixture model
37Hit Ratio Results All Users
Hit ratio results on all users for 3 component
mixture model
- Maxent performs better than Markov for all users.
- Maxent chooses the most flat function and has
more parameters. Mixture of Markov models can
better fit the training data (for returning
users).
Hit ratio results on all users for 10 component
mixture model
38User Behavior Clusters - Maxent Model Graphs
- Cluster 4 Users Querying CiteSeer through
another search engine - Cluster 6 Users who use CiteSeer query interface
- Cluster 9 Users who look at recommendations and
citations
39Conclusions
- Both mixture of maximum entropy and Markov models
are able to generate strong predictive models. - Markov models are better for predicting future
actions of returning users. - Maximum Entropy models perform better for
predicting the global model and first time users. - Predicting future actions could be useful in
recommending shortcuts and personalizing the user
interface. - CiteSeer users significantly differ in the way
they use the engine. - Mixture models can be used to discover these
different patterns. - Identifying the cluster a session belongs to can
be used to recommend certain functionalities of
CiteSeer (e.g. bibtex, recommendations, etc).
Manavoglu, Pavlov, Giles ICDM 2003
40Navigation Graph
41Query Behavior Analysis(Work in Progress)
42Query Behavior Patterns - Motivation
- Goal
- To investigate how the users interact with the
query interface - To identify dominant patterns of querying
behavior, which could then be used in
recommendations - If the user is submitting the same query over and
over should we show the same recommendations? - If the user is modifying the original query
string persistently should we change the
recommendation criteria to give more priority to
diversity of the recommendations (because
modifying the query could be an indication of
unsatisfactory results/recommendations)?
43Query Types
- Query Types
- New Query String none the terms in the query
(excluding stop words and logical operators) were
seen in the previous query - Modified Query String a partially new query,
some terms are eliminated and/or added - Same Query String same as the most previous
query string - Document Query Document database is queried
- Citation Query Citation database is queried
44Query Behavior Patterns - Methodology
- Sequential Modeling
- Sessionize log files
- Consider the queries in a session as an ordered
sequence - Identify the query types within each session
- Investigate the patterns for the ordered sequence
of query types - Algorithm
- Mixture of 1st order Markov
- Experimental Setup
- Same as the previous section
45Query Types - Example
- Input
- ltsessionID queryEvent userID queryStringgt
- 1 documentquery uid1 online routing
- 1 documentquery uid1 online routing lipmann
- 1 documentquery uid1 lipmann
- 1 documentquery uid1 Maarten Lipmann
- 1 documentquery uid1 oltsp
- 1 citationquery uid1 oltsp
- Output
- uid1 docNew docModified docModified docModified
docNew citeOld
46Query Behavior - Results
47Query Behavior Analysis - Observations
- Users tend to submit the same query more than
once in the same session - Only a small portion of CiteSeer users are
submitting citation queries (15) - CiteSeer users are trying to modify their queries
instead of giving up - Can query modification be viewed as an indication
to change document recommendation or ranking
strategy?
48Online Clustering of Search Results
49Online Clustering of Search Results -Motivation
- So far we looked at users behavior to improve
user experience and retrieval quality. - Is it also possible to organize the data
adaptively to improve the quality of the results? - Grouping topically related items together in the
result page may improve coverage and diversity of
the displayed results
50Why Online?
- An offline clustering/classification of search
results would not be query dependent. - Two documents could be related for the query
machine learning but they may not be as
relevant for a rather more specific one machine
learning for recommender systems. - The time of query changes the set of available
documents. - A news search on hurricane query today and
hurricane query a week ago would result in very
different set of documents
51Online Clustering of Search Results - Test Case
- To evaluate the performance of clustering on a
frequently updated index (with deletions as well
as addition of documents) we chose to work on
News articles. - Due to architecture of the underlying news search
engine (Yahoo! News) the implemented solution had
to satisfy the following conditions - Access to 100 documents per query
- Work in a stateless/memoryless environment
- Fetch original results, cluster them and return
the clustered results in less than 300
milliseconds
This work was done during an internship at Yahoo!
The paper is in review for confidentiality
requirements. Please ask about the details.
52Online Clustering of Search Results - Algorithm
- Clustering Algorithm
- Hierarchical Agglomerative Clustering, Mixture of
Multinomials, Cluto, KNN, Kmeans were implemented
and compared. Hierarchical Aglomerative
Clustering (HAC) algorithm outperformed others in
news domain. - HAC algorithm works on the pairwise similarities
of news articles. - Pairwise similarity matrix is computed at the
time of query - The similarity of a cluster to another one is
computed as a factor of the pairwise document
similarities. - Similarity metric is an approximation of cosine
similarity measure. - Stopping criteria is a threshold on the
similarity of the clusters. The optimum threshold
value is discovered with an exhaustive search on
the hold-out dataset.
For a survey on Clustering algorithms see
Berkhin, 2002
53Online Clustering of Search Results - Clustering
Algorithm Evaluation
- HAC clustering algorithm was evaluated against
the aforementioned algorithms and Google News
Clusters by an editorial review. - 300 queries were chosen in random from that days
user logs. - Reviewers were shown 4 documents per cluster.
- Results of 2 different algorithms were displayed
side by side and reviewers were asked to chose
between the 2. - HAC was shown to outperform the rest of the
algorithms (the difference was statistically
significant)
54Online Clustering of Search Results - Naming the
Clusters
- Users may not realize how and why the documents
were grouped together. Having names for clusters
serving as short descriptions would help users
process and navigate through the result set
easier. - Common naming approaches
- Choose the most representative article for each
cluster and use its title as the cluster label
(Google news search http//news.google.com) - Find the most representative phrases within the
documents belonging to each cluster (Clusty
http//news.clusty.com/)
For Keyphrase Extraction see Zha, SIGIR 2002
55Online Clustering of Search Results - Naming the
Clusters
- Observations
- News articles are usually formulated to answer
more than one of the following questions where,
when, who, what and how. - Articles belonging to the same cluster may be
sharing answers to only some of these questions. - Goal
- We want to partition the titles into substrings
corresponding to one (or more) of these
questions. And then choose among these substrings
the most common one.
56Online Clustering of Search Results - Naming the
Clusters
- Methodology
- Use POS tags to find the boundaries between
compact substrings. - Break the sentence into smaller parts at these
points and generate a candidate set composed of
these shorter substrings and their continuous
concatenations. - Example
- Title Wilma forces Caribbean tourists to flee
- Substrings Wilma, Caribbean tourists, flee
- Candidates Wilma, Wilma forces Caribbean
tourists, Caribbean tourists, Wilma forces
Caribbean tourists to flee, flee, Caribbean
tourists to flee - Score the candidates based on coverage, frequency
or length or a combination of these metrics. - Choose the candidate with the highest score to be
the cluster label.
57Online Clustering of Search Results - Naming the
Clusters
- Scoring Algorithms
- Coverage Score A metric to measure how well the
candidate covers the information in all titles. - Frequency Score Is the frequency score for the
full string. - Length normalized frequency Score Candidates
with less number of words will have more
frequencies. To avoid this bias towards shorter
candidates we normalize the frequency scores
based on the number of words they include
58Online Clustering of Search Results - Naming
Evaluation
- 160 queries were chosen at random from the daily
query logs. The results of these queries were
clustered and saved. - Candidate names for each cluster were generated
offline. Candidates included titles as well. - 8 users were asked to either select from the list
of candidates the best label or type in their own
choice if they could not find an appropriate
candidate. - At least 3 judgments were collected per query.
- 55 of the names chosen by reviewers were titles.
- Experimental results showed that length
normalized frequency score outperforms the others
(60 match with user labels). - In the case of ties, coverage scores were found
to be the best tie-breaker.
59Work In Progress
- Further query analysis
- What are the users searching for Titles,
abstracts, references, publication dates, venues? - Are the users viewing/downloading on linked
documents, or not? - Investigating community based adaptation
strategies - Designing a framework for online evaluation of
recommender systems - Such a platform for CiteSeer was designed by
Cosley et al (called REFEREE) in its early days.
Our goal is to make it scale to and be available
in the next generation CiteSeer.
Recently published paper on discovering
E-communities Zhou, Manavoglu, Li, Giles, Zha in
WWW 2006
60My Proposal
- MyCiteSeer
- A unified adaptive information retrieval engine
where document, action and query behaviors are
all factored in
61How?
- Action Model selects the modules and shortcuts to
be displayed - user U is interested in only users who viewed
document recommendations gt only this document
recommendation module will be shown on document
page - Query Model and Action Model selects the query
interface - user U searches for titles only and looks at more
than 20 results gt the default query field is
title and results are clustered - Document model selects the content with input
from Action and Query models. - Action and Query models act as implicit feedback
62Evaluation?
- A user study would give the most reliable and
comprehensive assessment of the system.
Individual components as well as the overall
system can be evaluated. - Some form of implicit feedback can also be used
for evaluation. - Clickthrough rate on results page, number of
actions taken to access a given document/page,
time spent on completing certain tasks. - If used together with some explicit user feedback
(through random questionnaires, ratings, or
feedback via email, blogs or forums) the
appropriate metrics could be identified and used.