Efficient Data Mining for Path Traversal Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Data Mining for Path Traversal Patterns

Description:

Efficient Data Mining for Path Traversal Patterns. CS401 Paper ... are needed to conduct mining on these huge data. ... data mining capability ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 30
Provided by: UMR
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Data Mining for Path Traversal Patterns


1
Efficient Data Mining for Path Traversal Patterns
  • CS401 Paper Presentation
  • Chaoqiang chen
  • Guang Xu

2
Overview
  • Introduction
  • Problem Formulation
  • Algorithm for Traversal Pattern
  • 1. Identifying Maximal Forward
    References
  • 2. Determining Large Reference Sequences
  • Performance Results
  • 1. Generation of Synthetic Traversal Paths
  • 2. Performance Comparison Between FS and SS
  • 3. Sensitivity Analysis
  • Advantages and Disadvantages
  • Conclusions

3
Introduction
  • Analysis of past transaction data can provide
    very valuable information on customer buying
    behavior, and thus improve the quality of
    business decisions.
  • It is essential to collect a sufficient amount of
    sales data before any meaningful conclusion can
    be drawn.
  • Efficient algorithms are needed to conduct mining
    on these huge data.
  • One of the most important data mining problems is
    mining association rules. The presence of some
    items in a transaction will imply the presence of
    other items in the same transaction.

4
Introduction
  • Mining access patterns in a distributed
    information providing environment where objects
    are linked together to facilitate interactive
    access.
  • 1. Improve the system design provide efficient
    access between highly correlated objects etc.
  • 2. Better marketing decisions putting
    advertisements in proper places etc.
  • Algorithms for mining traversal patterns
  • 1. Algorithm MF (standing for maximal forward
    references) to convert the original sequence of
    log data into a set of maximal forward
    references.
  • 2. Algorithms FS (full-scan) and SS
    (selective-scan) for determining large reference
    sequences.

5
Problem Formulation
  • Traversal path for a user A,B,C,D,C,B,E,G,H,G,W,
    A,O,U,O,V
  • Maximal forward references ABCD, ABEGH, ABEGW,
    AOU, AOV

6
Problem Formulation
  • Some nodes might be revisited because of its
    location, rather than its content.
  • Assume that a backward reference is mainly for
    ease of traveling but not for browsing
  • When a backward references occur, a forward
    reference path terminates, this resulting forward
    reference path is termed as maximal forward
    reference.
  • A large reference sequence is a reference
    sequence that appeared in a sufficient number of
    times in a set of maximal forward references
  • The number of times a reference sequence has to
    appear in order to be qualified as a large
    reference sequence is called the minimal support.

7
Problem Formulation
  • A large k-reference is a large reference sequence
    with k elements. Set of large k-references is
    denoted as LK , its candidate set as CK .
  • A maximal reference sequence corresponds to a
    hot access pattern in an information providing
    service
  • A maximal reference sequences is a large
    reference sequence that is not contained in any
    other maximal reference sequence.
  • Suppose that AB, BE, AD, CG, GH, BG is the set
    of large 2-references (i.e. L2) and ABE, CGH is
    the set of large 3-references (i.e. L3), then,
    the resulting maximal reference sequences are
    AD, BG, ABE, CGH

8
Problem Formulation
  • Procedure for mining traversal patterns
  • Step 1 Determine maximal forward references
    from the original log data.
  • Step 2 Determine large reference sequences
    (i.e., LK , K gt 1) from the set of maximal
    forward references.
  • Step 3 Determine maximal reference sequences
    from large reference sequences.
  • Extraction of maximal reference sequences from
    large reference sequences (i.e. Step 3) is
    straightforward
  • Focus on Steps 1 and 2

9
Problem Formulation
D set of maximal forward references Ci
candidate set of large reference sequences
Li set of large reference
sequences Support 2
10
Algorithm for Traversal Pattern ---Identifying
Maximal Forward References
  • A traversal log database contains, for each link
    traversed, a pair of (source, destination)
  • The traversal log database is sorted by user
    ids, resulting in a traversal path,
    (s1,d1),(s2,d2), ,(sn,dn), for each user,
    where pairs of (si,di) are ordered by time.
  • Then algorithm MF is applied to determine the
    maximal forward references.

11
Algorithm for Traversal Pattern ---Identifying
Maximal Forward References
12
Algorithm for Traversal Pattern --- Determining
large Reference Sequences (FS)
  • Algorithm FS utilizes key ideas of the DHP (i.e.,
    hashing and pruning) technique
  • DHP efficient generation for large itemsets,
    effective reduction on transaction database size
    after each scan.
  • Ck can be generated from joining Lk-1 with itself
  • Different from DHP, in FS, for any two distinct
    reference sequences in Lk-1 , say r1,,rk-1 and
    s1,,sk-1 , we join them to form a k-reference
    sequence only if either r1,,rk-1 contains
    s1,,sk-2 or s1,,sk-1 contains r1,,rk-2

13
Algorithm for Traversal Pattern --- Determining
large Reference Sequences (FS)
Li set of large reference
sequences Support 2
D set of maximal forward references Ci
candidate set of large reference sequences
14
Algorithm for Traversal Pattern --- Determining
large Reference Sequences (SS)
  • Utilizing the information in candidate reference
    in prior passes to avoid database scans in some
    passes, thus further reducing the disk I/O cost.
  • Generate a C3 from C2 C2, instead of from L2
    L2, and both C2 and C3 can stored in the main
    memory, we can find L2 and L3 together when the
    next scan of the database is performed.
  • If Ck1gt Ck for some k gt 2, it is
    usually beneficial to have a database scan to
    obtain Lk1 before the set of candidate
    references becomes too big.

15
Performance Results --- Generation of Synthetic
Traversal Paths
16
Performance Results --- Generation of Synthetic
Traversal Paths
  • The browsing scenario in a World Wide Web (WWW)
    environment is simulated. A traversal tree is
    constructed to mimic WWW structure whose starting
    position is a root node of the tree.
  • The traversal tree consists of internal nodes and
    leaf nodes
  • The number of child nodes at each internal node,
    referred to as fanout, is determined from a
    uniform distribution with a given range
  • The height of a subtree whose subroot is a child
    node of the root node is determined from a
    Poisson distribution
  • A traversal path consists of nodes accessed by a
    user. The size of each traversal path is picked
    from a Possion distribution.

17
Performance Results --- Generation of Synthetic
Traversal Paths
  • With the first node being the root node, a
    traversal path is generated probabilistically
    within the traversal tree.
  • 1. For internal nodes
  • p0 probability go back to parent node
  • p1, p2, p3, p4 probability go to the
    child nodes
  • pj probability jump to another internal
    node
  • 2. For leaf node 25 to parent, 75 jump to
    internal node
  • The number of internal nodes with internal jumps
    is denoted by NJ, which is set to 3 of all the
    internal nodes in general cases.
  • Sensitivity of varying NJ will be analyzed.

18
Performance Results --- Generation of Synthetic
Traversal Paths
19
Performance Results --- Performance Comparison
between FS and SS
  • D 200,000, NJ 3, and pj 0.1
  • The fanout at each internal node is between 4 and
    7.
  • The root node consists of 7 child nodes
  • The number of internal nodes is 16,200, the
    number of leaf nodes is 73,006
  • Algorithm SS in general outperforms FS, and their
    performance difference becomes prominent when the
    I/O cost is taken into account

20
Performance Results --- Performance Comparison
between FS and SS
21
Performance Results --- Performance Comparison
between FS and SS
22
Performance Results --- Performance Comparison
between FS and SS
  • Algorithm SS consistently outperforms FS as the
    database size increases

23
Performance Results --- Sensitivity Analysis
  • D 200,000, P 10, the minimum support is
    0.75
  • As the probability to backward at an internal
    node, p0, increases, the number of large
    reference sequences decreases because the
    possibility of having forward traveling becomes
    smaller.

24
Performance Results --- Sensitivity Analysis
  • The number of large reference sequences decreases
    as the number of child nodes of internal nodes
    (fanout) increases
  • Because with a larger fanout the traversal paths
    are more likely to be dispersed to several
    branches, thus resulting in fewer large reference
    sequences.

25
Performance Results --- Sensitivity Analysis
  • The probability of traveling to each child node
    from an internal node is determined from a
    Zipf-like distribution
  • The number of large reference sequences increases
    when the corresponding probabilities are more
    skewed..

26
Performance Results --- Sensitivity Analysis
27
Advantages and Disadvantages
  • Advantages
  • 1. Make use of the concept of DHP (i.e., hashing
    and pruning) technique, thus the proposed
    algorithms are very efficient in generating set
    of large reference sequences Lk and in reducing
    size of the database of maximal forward
    references after each scan. By this way, both CPU
    and I/O costs are reduced.
  • 2. By utilizing the information in candidate
    references in prior passes, algorithm SS avoids
    database scans in some passes, thus further
    reducing the disk I/O cost.

28
Advantages and Disadvantages
  • Disadvantages
  • 1. Since users backward browsing is treated as
    for ease of traveling, there exist a possibility
    that some information about the users browsing
    behavior get lost. It is better that the backward
    traversal pattern can also be mined to reflect
    the users real behavior.
  • 2. When traversal log record only contains
    destination references instead of a pair of
    references. The MF algorithm can not identify the
    breakpoint where the user picks a new URL to
    begin a new traversal path. This could increase
    the computational complexity because the paths
    considered become longer, especially in an
    environment where users jump from sites to sites
    frequently . Thus some complement method should
    be employed to deal with this problem.

29
Conclusions
  • A new data mining capability is explored. It
    involves mining traversal patterns in an
    information providing environment where documents
    or objects are linked together to facilitate
    interactive access.
  • Algorithm MF is employed to convert the original
    sequence of log data into a set of maximal
    forward references.
  • Algorithms FS and SS are developed to determine
    large reference sequences from the maximal
    forward references obtained.
  • FS is based on some hashing and pruning
    techniques, and SS is a further improvement of
    FS.
  • Performance of FS and SS has been comparatively
    analyzed. Algorithm SS in general outperforms
    algorithm FS. Sensitivity analysis on various
    parameters was also conducted.
Write a Comment
User Comments (0)
About PowerShow.com