Discovery of Indirect Association from Web Usage Data - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Discovery of Indirect Association from Web Usage Data

Description:

Vipin Kumar, PAKDD (May 2002) University of Minnesota ... Bedroom furniture frame. Home & Accessories Furniture. Oak nightstand. 8. Speakers. Electronics ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 39

Provided by: wwwuser

Category:

more less

Transcript and Presenter's Notes

Title: Discovery of Indirect Association from Web Usage Data

1
Discovery of Indirect Association from Web Usage
Data
Vipin Kumar Army High Performance Computing
Research Center Department of Computer Science
Engineering University of Minnesota kumar_at_cs.umn.
edu Homepage http//www.cs.umn.edu/kumar
2
Overview
Internet
Web usage data
3
Overview

Web Usage Mining
the automatic extraction of non-trivial patterns
from Web usage data
Main techniques
Clustering
to find groups of users who share similar
browsing behavior
Classification
to categorize Web users according to their past
access history
Association
to determine what are the set of page views
often accessed together in the same server
session

4
Application of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
5
Taxonomy of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
6
Mining Association Patterns in Web Data
Input
Output
/home ? /product ? /product/electronics
Frequent Itemsets
/home ? /shipping ? /shipping/status
Association Rules
/home ? /product ? /product/offer ?
/product/electronics/tv
Sequential Patterns
/home ? /product ? /product/electronics ?
/product/electronics/tv
Association Patterns
Click-streams
7
Mining Association Patterns in Web Data

Provide the following information
What are the set of pages frequently accessed
together by Web users? (frequent itemsets)
What page will be fetched next? (association
rules)
What are the paths frequently traversed by Web
users? (sequential patterns)
Web association patterns are useful
To improve Web site design
To develop prefetching and Web caching policies.
To recommend related pages
To collect business intelligence about behavior
of Web users

8
Association Mining is Support-based

Frequent itemset
combination of pages that have support greater
than a user-specified minimum threshold
Association Rule

Higher support implies greater statistical
significance
Support-based pruning constraints the exponential
complexity and makes the association rule
computation tractable

9
Can Infrequent Patterns be Interesting?

If support between A and B is too small, then
there may be a negative correlation between A and
B
P(A,B) lt P(A) P(B)
If (A ? B) has a high support, then support(A,B)
will tend to be low because
P(A B) P(B) P(A,B)
Example Coffee ? Tea

10
Approach 1 Using Negative Items

Computationally expensive
Tends to produce too many negative associations

11
Approach 2 Negative Itemsets

Savasere et al 1998
A negative itemset is a set of items whose actual
support is significantly lower than its expected
support
Expected support can be computed using item
taxonomy

Suppose C and G are frequent
12
Indirect Association
a
M
b
THEN a and b are expected to be frequent
? a and b are indirectly associated via mediator M
13
Non-Sequential Indirect Association Formulation

A pair of Web pages, a and b, are indirectly
associated via mediator set M, if
Sup(a,b) lt ts
Sup(a ? M) ? tf, Sup(b ? M) ? tf
Dep(a,M) ? td, Dep(b,M) ? td
where Dep(x,M) can be any reasonable objective
measures

14
Finding Interesting Negative Associations
For all pairs of items
With Mediator
No Mediator
FM
FN
Frequent
Minimum itempair support
IM
IN
If
Infrequent
IM/FM IN/FN
then Indirect Association is not surprising
mediator thresholds
15
Finding Interesting Negative Associations
With Mediator
No Mediator

IM/FM is small
IM/IN is small
? Indirect Association is interesting

Frequent
FM
FN
Infrequent
IN
IM
16
Finding Interesting Negative Association
Indirect Association is interesting when minimum
itempair support threshold is small. But, if
threshold is too low, very few indirect
associations are obtained.
17
Application Market-Basket Analysis

Substitute items up-sell
Pavilion PC ? 17 Monitor ? Pavilion multimedia
PC
Competing items competitive analysis
Coke ? Ruffles ? Pepsi
Complementary items cross-sell
Tekken 3 ? Playstation Memory Card ? Tomb Raider 2

18
Other Applications Information Retrieval

Identify synonyms and antonyms
Identify the different contexts of a queried word
Useful to group together query results

Union
Trade
Soviet
Worker
19
Other Applications Stock Market
Indirect Association can partition events that
are associated with the movement of a stock price.
20
Sequential Indirect Association Formulation

A pair of Web pages, a and b, forms a sequential
indirect association via a mediator sequence, w,
if the following conditions are satisfied
Support(a,b) lt ts
Support(s1) ? tf, Support(s2) ? tf (s1aw or
wa, s1bw or wb)
Dependence(a,w) ? td,
Dependence(b,w) ? td
w does not contain a and b
discover groups of users who have different
interests but share a common traversal path

21
Types of Sequential Indirect Association
Different entry points
Different entry and user interests
Different user interests
Type C (convergence)
Type D (divergence)
Type T (transitive)
22
Clustering vs Indirect Association

Clustering is another way to discover different
groups of users
A. Banerjee and J.Ghosh, Clickstream Clustering
using Weighted Longest Common Subsequences,
Workshop on Web Mining (2001)
TW Yan, M Jacobsen, H Garcia-Molina and U Dayal,
From User Access Patterns to Dynamic Hypertext
Linking, Proc of the 5th International World Wide
Web Conference (1996).
YJ Fu, K Sandhu, MY Shih, A Generalization-Based
Approach to Clustering of Web Usage Sessions, Web
Usage Analysis and User Profiling, LNAI (2001)
Clustering cannot find distinct groups of users
who share a similar traversal path since the
support of the mediator is large (mediator often
contains navigation pages)
It is more likely that several indirect
associations are contained within a single
cluster, rather than each indirect association
connects between two separate clusters

23
Indirect Association for Web Data
24
Impact of Site Structure on Associations

Navigation pages (e.g. home pages and hub pages)
tend to have higher support than content pages

Source U of Minnesota Computer Science
department Web logs (Jan 1-31, 2001)

If threshold is too high, most of the patterns
contain only navigation pages
If threshold is too low, too many patterns!

25
How Indirect Association helps

Indirect association groups together patterns
that have similar substructures
The common substructures (mediators) often
contain the navigation pages

Mediator
Navigation pages
26
Grouping Indirect Associations

Indirect associations can also be grouped
together into more compact structures if they
share a common mediator

Check degree of association
27
Viewer for Non-Sequential Indirect Association
Indirect Pairs (dashed line)
List of Mediators
Frequent Pairs (solid line)
Currently Selected Mediator
28
Experimental Setup

Data sources
UMN University of Minnesota Web log
contains 91,443 page views and 34,526 sessions
EC E-commerce Web log
Contains 6664 page views and 143,604 sessions
Steps
Preprocessing
identify sessions, convert sessions into
transactions
Extract frequent itemsets or sequences
apply Apriori algorithm to find frequent itemsets
Apply GSP algorithm to find frequent sequences
Apply indirect association algorithm
Merge indirect associations that have common
mediator

29
Experimental Results (UMN)
Students taking CS as minor subject
Contact information
Prospective graduate students
CS Graduate student association
Students planning to take PhD exam
30
Experimental Results (UMN)
31
Experimental Results (EC)
32
Experimental Results (EC)
33
Experimental Results (EC)
34
Experimental Results (EC)
35
Experimental Results (EC)
36
Experimental Results (EC)
37
Conclusion

Indirect Association provides an alternative
approach to capture interesting infrequent
patterns
For Web data, indirect association represents the
distinct interests of Web users who share similar
traversal path
Such patterns cannot be easily found using
standard association and clustering techniques
Indirect Association can be used to group
together patterns into more compact structures
Navigation pages form the mediators

38
References

PN Tan, V Kumar, J Srivastava, Indirect
Association Mining Higher Order Dependencies in
Data, In Proc of the 4th European Conference on
Principles and Practice of Knowledge Discovery in
Databases, Lyon, France, Sept 13-16, 632-637
(2000)
PN Tan, V Kumar, H Kuno, Using SAS for Mining
Indirect Associations in Data, In Proc of the
Western Users of SAS Software Conference (2001)
PN Tan, V Kumar, Mining Indirect Associations in
Web Data, In Proc of WebKDD 2001 Mining Log Data
Across All Customer TouchPoints, August (2001)
KDD Web Mining workshops WebKDD 1999, WebKDD
2000, WebKDD 2001
SIAM Workshop on Web Mining 2001 and 2002