Title: Discovery of Aggregate Usage Profiles for Web Personalization
1Discovery of Aggregate Usage Profiles for Web
Personalization
- Bamshad Mobasher, Honghua Dai, Tao Luo, Miki
Nakagawa, Jim Wiltshire
School of Computer Science, Telecommunications,
and Information Systems DePaul University
2Web Personalization
- The Problem
- dynamically serve customized content (pages,
products, etc.) to users based on their profiles,
preferences, or expected interests - Current Approaches
- rule-based filtering
- usually relies on static profile for users in
part obtained through explicit registration - collaborative filtering
- usually requires explicit ratings from users on
similar types of objects - content-based filtering learn/store personal
profiles locally or on server-side - based on content similarity of user profile to
pages or product descriptions - Limitations of Current Technologies
- user input may be subjective and prone to bias
- explicit (and non-binary) user ratings may not be
available - profiles may be static and can become outdated
quickly - collaborative filtering problems with
scalability due to sparse data - content-based filtering may miss other semantic
relationships among objects
3Usage-Based Web Personalization
- Basic Idea
- find aggregate user profiles by automatically
discovering user access patterns through Web
usage mining (offline process) - data sources for mining include server logs,
other click-stream data (e.g., product-oriented
user events), and site structure - match a users active session against the
discovered profiles to provide dynamic content
(online process) - Advantages / Goals
- profiles are based on objective information (how
users actually use the site) - no explicit user ratings or interaction with
users (to enter a profile, etc.) - helps preserve user privacy, by making effective
use of anonymous data - usage data captures relationships missed by
content-based approaches - can help enhance the effectiveness of
collaborative or content-based filtering
techniques
4Automatic Web PersonalizationOffline Process
Data Preparation
Usage Mining
Transaction Clustering Pageview Clustering
Usage Profiles
Data Cleaning Session Identification Pageview
Identification Transaction Identification Support
Filtering
Server Logs Other Click-Stream Data
Association-Rule Discovery
Domain Knowledge
5Automatic Web PersonalizationOnline Process
Recommendation Engine
Input from the batch process
Recommendations
Active Session
6Data Preparation Tasks
- Preprocess and filter logs and other usage data
- remove redundant references and create pageviews
- domain knowledge to assign types to pageviews
- handle references to scripts creating dynamic
pages - map logs against site topology
- Identify user sessions and transactions
- heuristics based on IP, referrer, agent fields,
and session time-outs used to identify unique
user sessions (may need to infer missing
references) - intra-session transactions can be obtained based
on a model of user behavior (involves classifying
references as content or navigational for
each user) - weights are assigned to each pageview based on
static pageview types as well as some measure of
user interest (e.g., duration of pageview) - Support filtering - remove very low/high support
pageviews
7Aggregate Usage Profiles
- Characteristics of Aggregate Profiles
- the goal is to effectively capture common usage
patterns from potentially anonymous click-stream
data - profiles are represented as weighted collections
of pageviews - weights represent the significance of pageviews
within each profile - profiles are overlapping in order to capture
common interests among different groups/types of
users - multiple profiles may contribute to the
recommendation set for a given user - Example Profiles from the ACR (Assoc. for
Consumer Research) Site
1.00 Call for Papers 0.67 ACR News Special
Topics 0.67 CFP Journal of Psychology and
Marketing I 0.67 CFP Journal of Psychology and
Marketing II 0.67 CFP Journal of Consumer
Psychology II 0.67 CFP Journal of Consumer
Psychology I
1.00 CFP Winter 2000 SCP Conference 1.00 Call
for Papers 0.36 CFP ACR 1999 Asia-Pacific
Conference 0.30 ACR 1999 Annual
Conference 0.25 ACR News Updates 0.24 Conference
Update
8Methodologies for the Discovery of Aggregate
Profiles
- Discovery of Profiles Based on Transaction
Clusters - cluster user transactions - features are
significant pageviews identified in the
preprocessing stage - derive usage profiles (set of pageview-weight
pairs) based on characteristics of each
transaction cluster - Cluster Pageviews
- directly compute overlapping clusters of
pageviews based on co-occurrence patterns across
transactions - features are user transactions, so dimensionality
poses a problem for traditional clustering
algorithms - we use Association-Rule Hypergraph Partitioning
with an overlap factor
9Profile Aggregation Based on Clustering
Transactions (PACT)
- Input
- set of relevant pageviews in preprocessed log
- set of user transactions
- each transaction is a pageview vector
- Transaction Clusters
- each cluster contains a set of transaction
vectors - for each cluster compute centroid as cluster
representative - Aggregate Usage Profiles
- a set of pageview-weight pairs for transaction
cluster C, select each pageview pi such that
(in the cluster centroid) is greater than a
pre-specified threshold
10Hypergraph-Based Clustering
- Construct a hypergraph from sets of related items
- Each hyperedge represents a frequent itemset
- Weight of each hyperedge can be based on the
characteristics of frequent itemsets or
association rules
- Recursively partition hypergraph so that each
partition contains only highly connected data
items - Given a hypergraph G(V,E) we find a k-way
partitioning such that the weight of the
hyperedges that are cut is minimized - The fitness of partitions measured in terms of
the ratio of weights of cut edges to the weights
of uncut edges within the partitions - The connectivity measures the percentage of edges
within the partition with which the vertex is
associated -- used for filtering partitions - Vertices from partial edges can be added back to
clusters based on a user-specified overlap factor
11Profiles Based on Hypergraph Clusters of Pageviews
- Input
- input for clustering is the set of large itemsets
from association rule module - each itemset is a hyperedge (weights are a
function of the interest of the itemset)
- Aggregate Profiles (Pageview Clusters)
- hMETIS used as the underlying hypergraph
partitioning algorithm - clustering program directly outputs a set of
overlapping pageview clusters - the weight associated with pageview p in a
cluster C is based on the connectivity value of p
in hypergraph partition
12Recommendations Based on Usage Profiles
- Match current users activity against the
discovered usage profiles - a sliding window over the active session to
capture the current users short-term history
depth - usage profiles and the active session are treated
as vectors - matching score is computed based on the
similarity between vectors (e.g, normalized
cosine similarity) - Recommendations
- each pageview is assigned a recommendation score
based on - matching score to aggregate profiles
- information value of the pageview based on
domain knowledge (e.g., link distance of the
candidate recommendation to the active session) - recommendations are contributed by multiple
matching aggregate profiles
13Experimental Set-up
- The Data Sets
- Log data from the Association for Consumer
Research Web site - 18342 transactions, 62 pageview URLs (after
filtering) - Data set divided into training and evaluation
sets - Evaluation Methodology
- Portion of each transaction (based on a specified
window size) in evaluation set was used to
generate a recommendation set (based on a given
recommendation threshold) - For each transaction, the overall coverage of the
recommendation set was divided by the number of
recommendations to produce an accuracy measure - The overall score was computed (for each
threshold) by taking the average scores over all
transactions in the evaluation set
14Average Visit Percentage
AVP measures the likelihood that a user who
visits any page in a Given profile, also visits
other pages in that profile
15Evaluation Measuring Recommendation Accuracy
Recommendation accuracy results, using a active
session window of size 3.
16Evaluation Impact of Filtering
Comparison of PACT and Hypergraph (using window
size 2) for filtered and unfiltered data sets.
Filtering involved the removal of top-level
navigational pages from the data set, leaving
only deeper content-oriented pages.
17Conclusions
- Usage-Based Web Personalization
- results suggest that effective personalization
can be achieved even with anonymous and
short-term click-stream data - possibly useful in the early stages of
personalization when more detailed profiles are
not available for individual users - could be used effectively in conjunction with
other methods based on content-based or
collaborative filtering - Which Method is Best?
- PACT may be most appropriate when the goal is to
provide a more general personalization solution
involving a variety of objects across the whole
site - Hypergraph may be most appropriate when the goal
is to provide a highly focused set of
recommendations for specific portions of the site - In practice, usage-based methods need to be
combined with other techniques to provide an
integrated solution