Title: Efficient Data Mining for Path Traversal Patterns
1Efficient Data Mining for Path Traversal Patterns
- CS401 Paper Presentation
- Chaoqiang chen
- Guang Xu
2Overview
- Introduction
- Problem Formulation
- Algorithm for Traversal Pattern
- 1. Identifying Maximal Forward
References - 2. Determining Large Reference Sequences
- Performance Results
- 1. Generation of Synthetic Traversal Paths
- 2. Performance Comparison Between FS and SS
- 3. Sensitivity Analysis
- Advantages and Disadvantages
- Conclusions
3Introduction
- Analysis of past transaction data can provide
very valuable information on customer buying
behavior, and thus improve the quality of
business decisions. - It is essential to collect a sufficient amount of
sales data before any meaningful conclusion can
be drawn. - Efficient algorithms are needed to conduct mining
on these huge data. - One of the most important data mining problems is
mining association rules. The presence of some
items in a transaction will imply the presence of
other items in the same transaction.
4Introduction
- Mining access patterns in a distributed
information providing environment where objects
are linked together to facilitate interactive
access. - 1. Improve the system design provide efficient
access between highly correlated objects etc. - 2. Better marketing decisions putting
advertisements in proper places etc. - Algorithms for mining traversal patterns
- 1. Algorithm MF (standing for maximal forward
references) to convert the original sequence of
log data into a set of maximal forward
references. - 2. Algorithms FS (full-scan) and SS
(selective-scan) for determining large reference
sequences.
5Problem Formulation
- Traversal path for a user A,B,C,D,C,B,E,G,H,G,W,
A,O,U,O,V - Maximal forward references ABCD, ABEGH, ABEGW,
AOU, AOV
6Problem Formulation
- Some nodes might be revisited because of its
location, rather than its content. - Assume that a backward reference is mainly for
ease of traveling but not for browsing - When a backward references occur, a forward
reference path terminates, this resulting forward
reference path is termed as maximal forward
reference. - A large reference sequence is a reference
sequence that appeared in a sufficient number of
times in a set of maximal forward references - The number of times a reference sequence has to
appear in order to be qualified as a large
reference sequence is called the minimal support.
7Problem Formulation
- A large k-reference is a large reference sequence
with k elements. Set of large k-references is
denoted as LK , its candidate set as CK . - A maximal reference sequence corresponds to a
hot access pattern in an information providing
service - A maximal reference sequences is a large
reference sequence that is not contained in any
other maximal reference sequence. - Suppose that AB, BE, AD, CG, GH, BG is the set
of large 2-references (i.e. L2) and ABE, CGH is
the set of large 3-references (i.e. L3), then,
the resulting maximal reference sequences are
AD, BG, ABE, CGH
8Problem Formulation
- Procedure for mining traversal patterns
- Step 1 Determine maximal forward references
from the original log data. - Step 2 Determine large reference sequences
(i.e., LK , K gt 1) from the set of maximal
forward references. - Step 3 Determine maximal reference sequences
from large reference sequences. - Extraction of maximal reference sequences from
large reference sequences (i.e. Step 3) is
straightforward - Focus on Steps 1 and 2
9Problem Formulation
D set of maximal forward references Ci
candidate set of large reference sequences
Li set of large reference
sequences Support 2
10Algorithm for Traversal Pattern ---Identifying
Maximal Forward References
- A traversal log database contains, for each link
traversed, a pair of (source, destination) - The traversal log database is sorted by user
ids, resulting in a traversal path,
(s1,d1),(s2,d2), ,(sn,dn), for each user,
where pairs of (si,di) are ordered by time. - Then algorithm MF is applied to determine the
maximal forward references.
11Algorithm for Traversal Pattern ---Identifying
Maximal Forward References
12Algorithm for Traversal Pattern --- Determining
large Reference Sequences (FS)
- Algorithm FS utilizes key ideas of the DHP (i.e.,
hashing and pruning) technique - DHP efficient generation for large itemsets,
effective reduction on transaction database size
after each scan. - Ck can be generated from joining Lk-1 with itself
- Different from DHP, in FS, for any two distinct
reference sequences in Lk-1 , say r1,,rk-1 and
s1,,sk-1 , we join them to form a k-reference
sequence only if either r1,,rk-1 contains
s1,,sk-2 or s1,,sk-1 contains r1,,rk-2
13Algorithm for Traversal Pattern --- Determining
large Reference Sequences (FS)
Li set of large reference
sequences Support 2
D set of maximal forward references Ci
candidate set of large reference sequences
14Algorithm for Traversal Pattern --- Determining
large Reference Sequences (SS)
- Utilizing the information in candidate reference
in prior passes to avoid database scans in some
passes, thus further reducing the disk I/O cost. - Generate a C3 from C2 C2, instead of from L2
L2, and both C2 and C3 can stored in the main
memory, we can find L2 and L3 together when the
next scan of the database is performed. - If Ck1gt Ck for some k gt 2, it is
usually beneficial to have a database scan to
obtain Lk1 before the set of candidate
references becomes too big.
15Performance Results --- Generation of Synthetic
Traversal Paths
16Performance Results --- Generation of Synthetic
Traversal Paths
- The browsing scenario in a World Wide Web (WWW)
environment is simulated. A traversal tree is
constructed to mimic WWW structure whose starting
position is a root node of the tree. - The traversal tree consists of internal nodes and
leaf nodes - The number of child nodes at each internal node,
referred to as fanout, is determined from a
uniform distribution with a given range - The height of a subtree whose subroot is a child
node of the root node is determined from a
Poisson distribution - A traversal path consists of nodes accessed by a
user. The size of each traversal path is picked
from a Possion distribution.
17Performance Results --- Generation of Synthetic
Traversal Paths
- With the first node being the root node, a
traversal path is generated probabilistically
within the traversal tree. - 1. For internal nodes
- p0 probability go back to parent node
- p1, p2, p3, p4 probability go to the
child nodes - pj probability jump to another internal
node - 2. For leaf node 25 to parent, 75 jump to
internal node - The number of internal nodes with internal jumps
is denoted by NJ, which is set to 3 of all the
internal nodes in general cases. - Sensitivity of varying NJ will be analyzed.
18Performance Results --- Generation of Synthetic
Traversal Paths
19Performance Results --- Performance Comparison
between FS and SS
- D 200,000, NJ 3, and pj 0.1
- The fanout at each internal node is between 4 and
7. - The root node consists of 7 child nodes
- The number of internal nodes is 16,200, the
number of leaf nodes is 73,006 - Algorithm SS in general outperforms FS, and their
performance difference becomes prominent when the
I/O cost is taken into account
20Performance Results --- Performance Comparison
between FS and SS
21Performance Results --- Performance Comparison
between FS and SS
22Performance Results --- Performance Comparison
between FS and SS
- Algorithm SS consistently outperforms FS as the
database size increases
23Performance Results --- Sensitivity Analysis
- D 200,000, P 10, the minimum support is
0.75 - As the probability to backward at an internal
node, p0, increases, the number of large
reference sequences decreases because the
possibility of having forward traveling becomes
smaller.
24Performance Results --- Sensitivity Analysis
- The number of large reference sequences decreases
as the number of child nodes of internal nodes
(fanout) increases - Because with a larger fanout the traversal paths
are more likely to be dispersed to several
branches, thus resulting in fewer large reference
sequences.
25Performance Results --- Sensitivity Analysis
- The probability of traveling to each child node
from an internal node is determined from a
Zipf-like distribution - The number of large reference sequences increases
when the corresponding probabilities are more
skewed..
26Performance Results --- Sensitivity Analysis
27Advantages and Disadvantages
- Advantages
- 1. Make use of the concept of DHP (i.e., hashing
and pruning) technique, thus the proposed
algorithms are very efficient in generating set
of large reference sequences Lk and in reducing
size of the database of maximal forward
references after each scan. By this way, both CPU
and I/O costs are reduced. - 2. By utilizing the information in candidate
references in prior passes, algorithm SS avoids
database scans in some passes, thus further
reducing the disk I/O cost.
28Advantages and Disadvantages
- Disadvantages
- 1. Since users backward browsing is treated as
for ease of traveling, there exist a possibility
that some information about the users browsing
behavior get lost. It is better that the backward
traversal pattern can also be mined to reflect
the users real behavior. - 2. When traversal log record only contains
destination references instead of a pair of
references. The MF algorithm can not identify the
breakpoint where the user picks a new URL to
begin a new traversal path. This could increase
the computational complexity because the paths
considered become longer, especially in an
environment where users jump from sites to sites
frequently . Thus some complement method should
be employed to deal with this problem.
29Conclusions
- A new data mining capability is explored. It
involves mining traversal patterns in an
information providing environment where documents
or objects are linked together to facilitate
interactive access. - Algorithm MF is employed to convert the original
sequence of log data into a set of maximal
forward references. - Algorithms FS and SS are developed to determine
large reference sequences from the maximal
forward references obtained. - FS is based on some hashing and pruning
techniques, and SS is a further improvement of
FS. - Performance of FS and SS has been comparatively
analyzed. Algorithm SS in general outperforms
algorithm FS. Sensitivity analysis on various
parameters was also conducted.