Title: Discovery of Indirect Association from Web Usage Data
1Discovery of Indirect Association from Web Usage
Data
Vipin Kumar Army High Performance Computing
Research Center Department of Computer Science
Engineering University of Minnesota kumar_at_cs.umn.
edu Homepage http//www.cs.umn.edu/kumar
2Overview
Internet
Web usage data
3Overview
- Web Usage Mining
- the automatic extraction of non-trivial patterns
from Web usage data - Main techniques
- Clustering
- to find groups of users who share similar
browsing behavior - Classification
- to categorize Web users according to their past
access history - Association
- to determine what are the set of page views
often accessed together in the same server
session
4Application of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
5Taxonomy of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
6Mining Association Patterns in Web Data
Input
Output
/home ? /product ? /product/electronics
Frequent Itemsets
/home ? /shipping ? /shipping/status
Association Rules
/home ? /product ? /product/offer ?
/product/electronics/tv
Sequential Patterns
/home ? /product ? /product/electronics ?
/product/electronics/tv
Association Patterns
Click-streams
7Mining Association Patterns in Web Data
- Provide the following information
- What are the set of pages frequently accessed
together by Web users? (frequent itemsets) - What page will be fetched next? (association
rules) - What are the paths frequently traversed by Web
users? (sequential patterns) - Web association patterns are useful
- To improve Web site design
- To develop prefetching and Web caching policies.
- To recommend related pages
- To collect business intelligence about behavior
of Web users
8Association Mining is Support-based
- Frequent itemset
- combination of pages that have support greater
than a user-specified minimum threshold - Association Rule
- Higher support implies greater statistical
significance - Support-based pruning constraints the exponential
complexity and makes the association rule
computation tractable
9Can Infrequent Patterns be Interesting?
- If support between A and B is too small, then
there may be a negative correlation between A and
B - P(A,B) lt P(A) P(B)
- If (A ? B) has a high support, then support(A,B)
will tend to be low because - P(A B) P(B) P(A,B)
- Example Coffee ? Tea
10Approach 1 Using Negative Items
- Computationally expensive
- Tends to produce too many negative associations
11Approach 2 Negative Itemsets
- Savasere et al 1998
- A negative itemset is a set of items whose actual
support is significantly lower than its expected
support - Expected support can be computed using item
taxonomy
Suppose C and G are frequent
12Indirect Association
a
M
b
THEN a and b are expected to be frequent
? a and b are indirectly associated via mediator M
13Non-Sequential Indirect Association Formulation
- A pair of Web pages, a and b, are indirectly
associated via mediator set M, if - Sup(a,b) lt ts
- Sup(a ? M) ? tf, Sup(b ? M) ? tf
- Dep(a,M) ? td, Dep(b,M) ? td
- where Dep(x,M) can be any reasonable objective
measures
14Finding Interesting Negative Associations
For all pairs of items
With Mediator
No Mediator
FM
FN
Frequent
Minimum itempair support
IM
IN
If
Infrequent
IM/FM IN/FN
then Indirect Association is not surprising
mediator thresholds
15Finding Interesting Negative Associations
With Mediator
No Mediator
- IM/FM is small
- IM/IN is small
- ? Indirect Association is interesting
Frequent
FM
FN
Infrequent
IN
IM
16Finding Interesting Negative Association
Indirect Association is interesting when minimum
itempair support threshold is small. But, if
threshold is too low, very few indirect
associations are obtained.
17Application Market-Basket Analysis
- Substitute items up-sell
- Pavilion PC ? 17 Monitor ? Pavilion multimedia
PC - Competing items competitive analysis
- Coke ? Ruffles ? Pepsi
- Complementary items cross-sell
- Tekken 3 ? Playstation Memory Card ? Tomb Raider 2
18Other Applications Information Retrieval
- Identify synonyms and antonyms
- Identify the different contexts of a queried word
- Useful to group together query results
Union
Trade
Soviet
Worker
19Other Applications Stock Market
Indirect Association can partition events that
are associated with the movement of a stock price.
20Sequential Indirect Association Formulation
- A pair of Web pages, a and b, forms a sequential
indirect association via a mediator sequence, w,
if the following conditions are satisfied - Support(a,b) lt ts
- Support(s1) ? tf, Support(s2) ? tf (s1aw or
wa, s1bw or wb) - Dependence(a,w) ? td,
- Dependence(b,w) ? td
- w does not contain a and b
- discover groups of users who have different
interests but share a common traversal path
21Types of Sequential Indirect Association
Different entry points
Different entry and user interests
Different user interests
Type C (convergence)
Type D (divergence)
Type T (transitive)
22Clustering vs Indirect Association
- Clustering is another way to discover different
groups of users - A. Banerjee and J.Ghosh, Clickstream Clustering
using Weighted Longest Common Subsequences,
Workshop on Web Mining (2001) - TW Yan, M Jacobsen, H Garcia-Molina and U Dayal,
From User Access Patterns to Dynamic Hypertext
Linking, Proc of the 5th International World Wide
Web Conference (1996). - YJ Fu, K Sandhu, MY Shih, A Generalization-Based
Approach to Clustering of Web Usage Sessions, Web
Usage Analysis and User Profiling, LNAI (2001) - Clustering cannot find distinct groups of users
who share a similar traversal path since the
support of the mediator is large (mediator often
contains navigation pages) - It is more likely that several indirect
associations are contained within a single
cluster, rather than each indirect association
connects between two separate clusters
23Indirect Association for Web Data
24Impact of Site Structure on Associations
- Navigation pages (e.g. home pages and hub pages)
tend to have higher support than content pages
Source U of Minnesota Computer Science
department Web logs (Jan 1-31, 2001)
- If threshold is too high, most of the patterns
contain only navigation pages - If threshold is too low, too many patterns!
25How Indirect Association helps
- Indirect association groups together patterns
that have similar substructures - The common substructures (mediators) often
contain the navigation pages
Mediator
Navigation pages
26Grouping Indirect Associations
- Indirect associations can also be grouped
together into more compact structures if they
share a common mediator
Check degree of association
27Viewer for Non-Sequential Indirect Association
Indirect Pairs (dashed line)
List of Mediators
Frequent Pairs (solid line)
Currently Selected Mediator
28Experimental Setup
- Data sources
- UMN University of Minnesota Web log
- contains 91,443 page views and 34,526 sessions
- EC E-commerce Web log
- Contains 6664 page views and 143,604 sessions
- Steps
- Preprocessing
- identify sessions, convert sessions into
transactions - Extract frequent itemsets or sequences
- apply Apriori algorithm to find frequent itemsets
- Apply GSP algorithm to find frequent sequences
- Apply indirect association algorithm
- Merge indirect associations that have common
mediator
29Experimental Results (UMN)
Students taking CS as minor subject
Contact information
Prospective graduate students
CS Graduate student association
Students planning to take PhD exam
30Experimental Results (UMN)
31Experimental Results (EC)
32Experimental Results (EC)
33Experimental Results (EC)
34Experimental Results (EC)
35Experimental Results (EC)
36Experimental Results (EC)
37Conclusion
- Indirect Association provides an alternative
approach to capture interesting infrequent
patterns - For Web data, indirect association represents the
distinct interests of Web users who share similar
traversal path - Such patterns cannot be easily found using
standard association and clustering techniques - Indirect Association can be used to group
together patterns into more compact structures - Navigation pages form the mediators
38References
- PN Tan, V Kumar, J Srivastava, Indirect
Association Mining Higher Order Dependencies in
Data, In Proc of the 4th European Conference on
Principles and Practice of Knowledge Discovery in
Databases, Lyon, France, Sept 13-16, 632-637
(2000) - PN Tan, V Kumar, H Kuno, Using SAS for Mining
Indirect Associations in Data, In Proc of the
Western Users of SAS Software Conference (2001) - PN Tan, V Kumar, Mining Indirect Associations in
Web Data, In Proc of WebKDD 2001 Mining Log Data
Across All Customer TouchPoints, August (2001) - KDD Web Mining workshops WebKDD 1999, WebKDD
2000, WebKDD 2001 - SIAM Workshop on Web Mining 2001 and 2002