Weblog Cleaning for Constructing Sequential Classifiers - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Weblog Cleaning for Constructing Sequential Classifiers

Description:

9/2/09. Data Cleaning Workshop. 1. Web-log Cleaning for Constructing ... 0400] 'GET /shuttle/resources/orbiters/endeavour/index.html HTTP/1.0' 200 5052 ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: QYa3
Category:

less

Transcript and Presenter's Notes

Title: Weblog Cleaning for Constructing Sequential Classifiers


1
Web-log Cleaning for Constructing Sequential
Classifiers
  • Qiang Yang
  • Hong Kong University of Science and Technology
  • Hong Kong
  • T.Y. Li and Ke Wang
  • Simon Fraser University, Canada

2
Web Usage Mining
  • uplherc.upl.com -- 01/Aug/1995000852 -0400
    "GET /shuttle/resources/orbiters/endeavour/index.h
    tml HTTP/1.0" 200 5052
  • pm9.j51.com -- 01/Aug/1995000852 -0400 "GET
    /images/WORLD-logosmall.gif HTTP/1.0" 200 669
  • 139.230.35.135 -- 01/Aug/1995000852 -0400
    "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786

A user session
A.html
B.html
C.html
D.html
IP
3
Web Logs are Available
47,748 requests for 3770 pages
EPA 24 hours, Aug 29, 95
NASA Log (Kennedy Flight Center 1 month, 95)
1,569,898 visits on 15,429 pages
525,378 visits on 6727 pages
Monarsh University (UM99, 50 day, 98)
4
Web Access Prediction
  • Method Association-Rule based Models

5
Build Prediction Models
Association Rule based Predictive Model
A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D
1
2
Window of Prediction
Current Observations
?
Sizem
B
C
A
1
2
Extract rules
Select rules
6
Moving Window Algorithm
  • Curent window W W1 W2
  • W1 observation window
  • W2 prediction window (size m)
  • W slides from beginning of session to end

(A, B, C, A, C, D, G) ?
7
Association Rules
  • LHS ? RHS
  • RHS restricted to one URL,
  • but can be relaxed to more than one URL
  • RHS the most popular URL with the same LHS
  • For each W1, if rule applies and RHS is in W2,
    then Success!

8
Rule-Representation Methods (min sup2)
  • Subset
  • A, C?C
  • Substring
  • BC?C
  • Latest SubstringC?C
  • Subsequence
  • Latest Subsequence

9
Rule Representation
  • Subset Rules
  • LHS a subset of items appearing in W1, with no
    order imposed
  • Corresponds to traditional association rules
  • Substring rules
  • LHS substring in W1 items must be adjacent,
    but LHS can start anywhere in W1
  • Latest Substring rules
  • Substring rules where LHS must end with W1
  • Also known as n-gram rules (n is a variable
    ltW1)
  • C4.5 Decision Trees
  • Default rule most popular item in web log

root
A
B
C
W1
W2
?
10
Information Embedded In Rules
  • Subset method appearance in any order
  • Subsequence method appearance order
    information
  • Latest-subsequence method appearance, order
    recency information
  • Substring method appearance, order adjacency
    information
  • Latest-substring method appearance, order,
    adjacency recency information

11
Rule-Selection Criteria
  • Among the rules whose LHS matches W1,
  • Longest-Match Selection
  • Select a rule whose left hand side is the longest
    to apply
  • Corresponds to using the strongest signature to
    predict
  • Most Confident
  • Select the rule with highest confidence to apply
  • Pessimistic Selection
  • UCF(E,N) is the upper bound on the estimated
    error for a given confidence value, assuming a
    normal distribution of error

12
Comparison Matrix
  • Comparison Criteria Precision Model Size

13
Longest Match (NASA)
  • rules controlled by min support
  • Latest-substring a clear winner

14
Most-confident (NASA)
  • Again latest-substring a winner
  • Drop off after 10,000 due to overfitting

15
Greedy-Dual-Size Frequency
  • Cache replacement algorithm
  • A key value K(p) is assigned to each cached
    object p
  • Arlltt et al. USENIX 1998, Cao Irani, 97
  • K(p) L F(p) C(p) / S(p)
  • C(p) Cost of loading a page (e.g., amount of
    time)
  • S(p) Size of a page
  • F(p) Frequency Count of a page
  • L An Inflation factor to reflect cache aging

16
Predicting future frequency
using latest-substringlongest match
O1 0.70 O2 0.90 O3 0.30 O4 0.11
Session 1
Predicted Frequency
W1 0.700.600.70 2.00 W2 0.900.700.90
2.50 W3 0.300.20 0.50 W4 0.110.30
0.41 W5 0.420.33 0.75
O1 0.60 O2 0.70 O3 0.20 O5 0.42
Session 2
O1 0.70 O2 0.90 O4 0.30 O5 0.33
Session 3
  • Ki L ( WiFi ) Ci / Si
  • Wi Future frequency Fi Past frequency

17
Hit Rate measures latency reduction
18
Rule Pruning not all rules are useful!
  • Suppose that we have two rules for testing case
    ltB, Cgt ? ?
  • Rule 1 ltA, B, Cgt ? D (confidence 50)
  • Rule 2 ltB, Cgt ? E (confidence 70)
  • In general, rules form a hierarchy we call
    Latest-Substring Index Tree (LSIT)
  • Each rule is represented by a node in the LSIT
  • The root of the LSIT representing the default
    rule.
  • The node representing the direct parent rule is
    the parent node of the node(s) representing the
    direct children rule(s)

19
LSIT Example
20
LSIT Pruning
21
LSIT Evaluation
Experiments are based on NASA data1,569,898
visits on 15,429 pages
22
Conclusions
  • Web-data mining requires extensive data cleaning
  • Data cleaning involves not only cleaning the raw
    data, but also the mined knowledge
  • In our case, the rule set is also cleaned to
    yield better results
Write a Comment
User Comments (0)
About PowerShow.com