Weblog Mining: from Pages to Relations - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Weblog Mining: from Pages to Relations

Description:

www.gazelle.com --- a e-commerce site. www.cs.washington.edu/education/courses ... Gazelle. 8/28/09. 22. III. Knowledge Middle: most cases. No relational knowledge ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 37
Provided by: QYa3
Category:

less

Transcript and Presenter's Notes

Title: Weblog Mining: from Pages to Relations


1
Web-log Mining from Pages to Relations
  • Qiang Yang
  • HKUST,
  • Hong Kong, China

2
A Spectrum of Web-Log Miners
Knowledge
  • Knowledge Rich
  • Have logical relations over page contents
  • Database generated pages
  • Can make accurate predictions even without data!
  • Knowledge Middle
  • Some hierarchical representations of ontology and
    content
  • Most cases!
  • Can predict based on similarity
  • Knowledge Poor
  • Have only page-level logs
  • No relational knowledge
  • Can predict observed pages only when data is
    plenty

Interesting!
3
1. Knowledge-poor -- Web-log mining
Association Rule based Predictive Model
A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D
1
2
Window of Prediction
Current Observations
?
Sizem
B
C
A
1
2
Extract rules
Select rules
4
Association Rules
  • LHS ? RHS
  • RHS restricted to one URL,
  • but can be relaxed to more than one URL
  • RHS the most popular URL with the same LHS
  • For each W1, if rule applies and RHS is in W2,
    then Success!

5
Rule-Representation Methods (min sup2)
  • Subset
  • A, C?C
  • Substring
  • BC?C
  • Latest SubstringC?C
  • Subsequence
  • Latest Subsequence

6
Rule-Selection Criteria
  • Among the rules whose LHS matches W1,
  • Longest-Match Selection
  • Select a rule whose left hand side is the longest
    to apply
  • Corresponds to using the strongest signature to
    predict
  • Most Confident
  • Select the rule with highest confidence to apply
  • Pessimistic Selection
  • UCF(E,N) is the upper bound on the estimated
    error for a given confidence value, assuming a
    normal distribution of error

7
Comparison Matrix
  • Compared with C4.5 as well.
  • Comparison Criteria Precision Model Size

8
Integrating with Caching
  • Cache replacement algorithm
  • A key value K(p) is assigned to each object p
  • Arlltt et al. USENIX 1998, Cao Irani, 97
  • K(p) L F(p) C(p) / S(p)
  • C(p) Cost of loading a page (e.g., amount of
    time)
  • S(p) Size of a page
  • F(p) Frequency Count of a page
  • L An Inflation factor to reflect cache aging

9
Predicting future frequency
O1 0.70 O2 0.90 O3 0.30 O4 0.11
Session 1
Predicted Frequency
W1 0.700.600.70 2.00 W2 0.900.700.90
2.50 W3 0.300.20 0.50 W4 0.110.30
0.41 W5 0.420.33 0.75
O1 0.60 O2 0.70 O3 0.20 O5 0.42
Session 2
O1 0.70 O2 0.90 O4 0.30 O5 0.33
Session 3
  • Ki L ( WiFi ) Ci / Si
  • Wi Future frequency
  • Fi Past frequency

10
Byte-hit Rate measures bandwidth reduction
Bytes answered by cache
Byte Hit Rate
Total bytes
11
Relative Network Traffic (NASA)
Prefetch vs. No-prefetch
Fractional latency (NASA)
12
Knowledge Rich The other extreme
  • Web log mining
  • User sessions
  • Markov models
  • But, sometimes data about specific pages are
    sparse!
  • Cannot train the Markov models properly
  • A single visitor views 0 of any site
  • New dynamic content not in training data
  • Now, many pages are generated automatically
  • Deep web
  • Dynamically generated pages
  • Question if we have the relational knowledge,
    what more can we do?

13
Relational Markov Models
  • RMM ( Relational Markov Model )
  • Group the same type pages into relations.
  • Combine low-level and high-level information
  • Automatically adapt web sites for different users
  • What relation means
  • Buys(Student, PC)
  • Student(ID, Name, Addr)
  • Relational Algebra is the basis for relational
    databases

14
Relational Markov Models
Anderson et al. KDD02
  • Domains often contain relational structure
  • Each state is a tuple in relational DB sense
  • Structure enables state generalization
  • Which allows learning from sparse data

15
Relational Markov Models
16
Relational Markov Model
17
RMM generalization
  • Want to estimate P(s ? d) but no data!
  • Use shrinkage
  • Can do this with abstractions of d and s
  • Let ? be an abstraction of s and ? of d

?
?
?
18
Learning and inference
  • Maximum use of available information
  • s are non negtive coefficient that
    sum to 1
  • over all abstractions of the source
    destination.

19
How to get
  • Intuitively, the lab should be large when
  • Abstractions are more specific
  • Training data is abundant
  • RMM-uniform
  • easy, fast, poor result
  • RMM-EM
  • poor result
  • RMM-rank
  • Slower than Markov model
  • Generally fast
  • For
  • require I(qs)I(qd)
  • hierarchy is large, slow computation

20
Adaptive Web Navigation
  • Andersons Algorithm
  • Personalize web sites based on a persons
    browsing pattern
  • (add a link, rearrange list items, etc)
  • Step 1 mining web server log to build models
    (RMM) of users
  • Step2 adapt the site for the user
  • Features
  • No training data on some pages
  • Periodical changing pages
  • Solution to sparse training data
  • Identifying semantic correspondence between
    pages, both visited and unseen

21
Evaluation (Anderson, Weld, Domingos 02)
Gazelle
  • www.gazelle.com --- a e-commerce site
  • www.cs.washington.edu/education/courses/
  • Generate good relational structure
  • Computation time RMM Markov model

22
III. Knowledge Middle most cases
  • No relational knowledge
  • But, has ontological knowledge
  • Still have the sparse data problem
  • However
  • Most web sites have some ontological structure
  • We can build Markov Models based on these
    structures

23
Vector Representation of Web Pages
  • A feature vector vi is defined by a set of
    features vi f1, f2..., fl

Where,P the whole page space,V the whole
feature-vector space
e.g. v1 ltpaths, keywords set,ltout-linking
vectorgt gt
24
The Similarity Function
  • where wk is the weight of the kth feature, and
    S(fik, fjk) is the similarity of two features in
    position k.
  • Features
  • a) different paths,
  • b) different keywords
  • c) different out links

25
Preprocessing ? vector space
26
Markov Model-Based Clustering Algorithm
  • Model based clustering algorithm
  • Step 1. Cluster the feature-vector sequences in
    to K-clusters by mixture Markov models (EM)
  • Step 2. For each new sequence s, calculate the
    probabilities that it belongs to K clusters
  • Difference from Candez00
  • We cluster the feature-vector sequences instead
    of pages
  • Goal classification with unseen data

27
Classification on Test Data
  • Measurements on Prediction Performance
  • Accuracy
  • Recall

28
Classification on Test Data
  • Suppose the vector sequence is
  • v v1v2...vp, where vi may not occur in the
    training data.
  • From similarity matrix M, we get a list (say
    top-K) similar vectors to vi.
  • Use Vij to denote the top-K similar vectors to vi.

29
Example of Similarity
  • Given a new sequence vlt4 4 5 2 5 9 4gt,
  • the feature vector 9 is new.
  • The top three similar feature vectors are 4,7,1.
  • Hence, we have three candidate sequences
  • lt4 4 5 2 5 4 4gt
  • lt4 4 5 2 5 7 4gt
  • lt4 4 5 2 5 1 4gt

30
Test Data COMP102 at HKUST
  • COMP102(http//www.cs.ust.hk/liao/comp102/),
  • Collected from 31/August/2002 to
    17/December/2002, C course to 1st year UG
  • 281 different Web pages
  • These pages are created at different times
  • 6,089 individual IP addresses visited
  • 255,074 valid requests
  • 60 megabytes in flat text format
  • 9913 sessions all together
  • Average session length is 9.3

31
COMP102 data (clicks)
32
COMP102, Browsing models
33
Test accuracy with/out similarity
  • Training first 42 days Testingnext 18 days
  • w1 (category) gtgtw2 (keywords), w3 (link)

34
Test recall with/out similarity
  • Training first 42 days Testingnext 18 days
  • w1 (category) gtgtw2 (keywords), w3 (link)

35
Test accuracy with/out similarity
  • Training every 10 days, Testing next 10 days
  • w1 (category) gtw2 (keywords)gt w3 (link)

36
Test recall with/out similarity
  • Training every 10 days, Testing next 10 days
  • w1 (category) gtw2 (keywords)gt w3 (link)

0 to 10 days
last 10 days
37
Conclusions and Future Work
  • Knowledge
  • Poor
  • Page-level Markov models
  • Middle
  • Ontological Markov models
  • Rich
  • Relational Markov models
  • Future Uncover implicit knowledge
Write a Comment
User Comments (0)
About PowerShow.com