Discovery of Indirect Association from Web Usage Data - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Discovery of Indirect Association from Web Usage Data

Description:

Vipin Kumar, PAKDD (May 2002) University of Minnesota ... Bedroom furniture frame. Home & Accessories Furniture. Oak nightstand. 8. Speakers. Electronics ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 39
Provided by: wwwuser
Category:

less

Transcript and Presenter's Notes

Title: Discovery of Indirect Association from Web Usage Data


1
Discovery of Indirect Association from Web Usage
Data
Vipin Kumar Army High Performance Computing
Research Center Department of Computer Science
Engineering University of Minnesota kumar_at_cs.umn.
edu Homepage http//www.cs.umn.edu/kumar
2
Overview
Internet
Web usage data
3
Overview
  • Web Usage Mining
  • the automatic extraction of non-trivial patterns
    from Web usage data
  • Main techniques
  • Clustering
  • to find groups of users who share similar
    browsing behavior
  • Classification
  • to categorize Web users according to their past
    access history
  • Association
  • to determine what are the set of page views
    often accessed together in the same server
    session

4
Application of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
5
Taxonomy of Web Usage Mining
Source J. Srivastava, R. Cooley, M. Deshpande,
PN Tan, Web Usage Mining Discovery and
Applications of Usage Patterns from Web Data,
SIGKDD Explorations (2000)
6
Mining Association Patterns in Web Data
Input
Output
/home ? /product ? /product/electronics
Frequent Itemsets
/home ? /shipping ? /shipping/status
Association Rules
/home ? /product ? /product/offer ?
/product/electronics/tv
Sequential Patterns
/home ? /product ? /product/electronics ?
/product/electronics/tv
Association Patterns
Click-streams
7
Mining Association Patterns in Web Data
  • Provide the following information
  • What are the set of pages frequently accessed
    together by Web users? (frequent itemsets)
  • What page will be fetched next? (association
    rules)
  • What are the paths frequently traversed by Web
    users? (sequential patterns)
  • Web association patterns are useful
  • To improve Web site design
  • To develop prefetching and Web caching policies.
  • To recommend related pages
  • To collect business intelligence about behavior
    of Web users

8
Association Mining is Support-based
  • Frequent itemset
  • combination of pages that have support greater
    than a user-specified minimum threshold
  • Association Rule
  • Higher support implies greater statistical
    significance
  • Support-based pruning constraints the exponential
    complexity and makes the association rule
    computation tractable

9
Can Infrequent Patterns be Interesting?
  • If support between A and B is too small, then
    there may be a negative correlation between A and
    B
  • P(A,B) lt P(A) P(B)
  • If (A ? B) has a high support, then support(A,B)
    will tend to be low because
  • P(A B) P(B) P(A,B)
  • Example Coffee ? Tea

10
Approach 1 Using Negative Items
  • Computationally expensive
  • Tends to produce too many negative associations

11
Approach 2 Negative Itemsets
  • Savasere et al 1998
  • A negative itemset is a set of items whose actual
    support is significantly lower than its expected
    support
  • Expected support can be computed using item
    taxonomy

Suppose C and G are frequent
12
Indirect Association
a
M
b
THEN a and b are expected to be frequent
? a and b are indirectly associated via mediator M
13
Non-Sequential Indirect Association Formulation
  • A pair of Web pages, a and b, are indirectly
    associated via mediator set M, if
  • Sup(a,b) lt ts
  • Sup(a ? M) ? tf, Sup(b ? M) ? tf
  • Dep(a,M) ? td, Dep(b,M) ? td
  • where Dep(x,M) can be any reasonable objective
    measures

14
Finding Interesting Negative Associations
For all pairs of items
With Mediator
No Mediator
FM
FN
Frequent
Minimum itempair support
IM
IN
If
Infrequent
IM/FM IN/FN
then Indirect Association is not surprising
mediator thresholds
15
Finding Interesting Negative Associations
With Mediator
No Mediator
  • IM/FM is small
  • IM/IN is small
  • ? Indirect Association is interesting

Frequent
FM
FN
Infrequent
IN
IM
16
Finding Interesting Negative Association
Indirect Association is interesting when minimum
itempair support threshold is small. But, if
threshold is too low, very few indirect
associations are obtained.
17
Application Market-Basket Analysis
  • Substitute items up-sell
  • Pavilion PC ? 17 Monitor ? Pavilion multimedia
    PC
  • Competing items competitive analysis
  • Coke ? Ruffles ? Pepsi
  • Complementary items cross-sell
  • Tekken 3 ? Playstation Memory Card ? Tomb Raider 2

18
Other Applications Information Retrieval
  • Identify synonyms and antonyms
  • Identify the different contexts of a queried word
  • Useful to group together query results

Union
Trade
Soviet
Worker
19
Other Applications Stock Market
Indirect Association can partition events that
are associated with the movement of a stock price.
20
Sequential Indirect Association Formulation
  • A pair of Web pages, a and b, forms a sequential
    indirect association via a mediator sequence, w,
    if the following conditions are satisfied
  • Support(a,b) lt ts
  • Support(s1) ? tf, Support(s2) ? tf (s1aw or
    wa, s1bw or wb)
  • Dependence(a,w) ? td,
  • Dependence(b,w) ? td
  • w does not contain a and b
  • discover groups of users who have different
    interests but share a common traversal path

21
Types of Sequential Indirect Association
Different entry points
Different entry and user interests
Different user interests
Type C (convergence)
Type D (divergence)
Type T (transitive)
22
Clustering vs Indirect Association
  • Clustering is another way to discover different
    groups of users
  • A. Banerjee and J.Ghosh, Clickstream Clustering
    using Weighted Longest Common Subsequences,
    Workshop on Web Mining (2001)
  • TW Yan, M Jacobsen, H Garcia-Molina and U Dayal,
    From User Access Patterns to Dynamic Hypertext
    Linking, Proc of the 5th International World Wide
    Web Conference (1996).
  • YJ Fu, K Sandhu, MY Shih, A Generalization-Based
    Approach to Clustering of Web Usage Sessions, Web
    Usage Analysis and User Profiling, LNAI (2001)
  • Clustering cannot find distinct groups of users
    who share a similar traversal path since the
    support of the mediator is large (mediator often
    contains navigation pages)
  • It is more likely that several indirect
    associations are contained within a single
    cluster, rather than each indirect association
    connects between two separate clusters

23
Indirect Association for Web Data
24
Impact of Site Structure on Associations
  • Navigation pages (e.g. home pages and hub pages)
    tend to have higher support than content pages

Source U of Minnesota Computer Science
department Web logs (Jan 1-31, 2001)
  • If threshold is too high, most of the patterns
    contain only navigation pages
  • If threshold is too low, too many patterns!

25
How Indirect Association helps
  • Indirect association groups together patterns
    that have similar substructures
  • The common substructures (mediators) often
    contain the navigation pages

Mediator
Navigation pages
26
Grouping Indirect Associations
  • Indirect associations can also be grouped
    together into more compact structures if they
    share a common mediator

Check degree of association
27
Viewer for Non-Sequential Indirect Association
Indirect Pairs (dashed line)
List of Mediators
Frequent Pairs (solid line)
Currently Selected Mediator
28
Experimental Setup
  • Data sources
  • UMN University of Minnesota Web log
  • contains 91,443 page views and 34,526 sessions
  • EC E-commerce Web log
  • Contains 6664 page views and 143,604 sessions
  • Steps
  • Preprocessing
  • identify sessions, convert sessions into
    transactions
  • Extract frequent itemsets or sequences
  • apply Apriori algorithm to find frequent itemsets
  • Apply GSP algorithm to find frequent sequences
  • Apply indirect association algorithm
  • Merge indirect associations that have common
    mediator

29
Experimental Results (UMN)
Students taking CS as minor subject
Contact information
Prospective graduate students
CS Graduate student association
Students planning to take PhD exam
30
Experimental Results (UMN)
31
Experimental Results (EC)
32
Experimental Results (EC)
33
Experimental Results (EC)
34
Experimental Results (EC)
35
Experimental Results (EC)
36
Experimental Results (EC)
37
Conclusion
  • Indirect Association provides an alternative
    approach to capture interesting infrequent
    patterns
  • For Web data, indirect association represents the
    distinct interests of Web users who share similar
    traversal path
  • Such patterns cannot be easily found using
    standard association and clustering techniques
  • Indirect Association can be used to group
    together patterns into more compact structures
  • Navigation pages form the mediators

38
References
  • PN Tan, V Kumar, J Srivastava, Indirect
    Association Mining Higher Order Dependencies in
    Data, In Proc of the 4th European Conference on
    Principles and Practice of Knowledge Discovery in
    Databases, Lyon, France, Sept 13-16, 632-637
    (2000)
  • PN Tan, V Kumar, H Kuno, Using SAS for Mining
    Indirect Associations in Data, In Proc of the
    Western Users of SAS Software Conference (2001)
  • PN Tan, V Kumar, Mining Indirect Associations in
    Web Data, In Proc of WebKDD 2001 Mining Log Data
    Across All Customer TouchPoints, August (2001)
  • KDD Web Mining workshops WebKDD 1999, WebKDD
    2000, WebKDD 2001
  • SIAM Workshop on Web Mining 2001 and 2002
Write a Comment
User Comments (0)
About PowerShow.com