Weblog Mining: from Pages to Relations - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Weblog Mining: from Pages to Relations

Description:

www.gazelle.com --- a e-commerce site. www.cs.washington.edu/education/courses ... Gazelle. 8/28/09. 22. III. Knowledge Middle: most cases. No relational knowledge ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 37

Provided by: QYa3

Category:

more less

Transcript and Presenter's Notes

Title: Weblog Mining: from Pages to Relations

1
Web-log Mining from Pages to Relations

Qiang Yang
HKUST,
Hong Kong, China

2
A Spectrum of Web-Log Miners
Knowledge

Knowledge Rich
Have logical relations over page contents
Database generated pages
Can make accurate predictions even without data!
Knowledge Middle
Some hierarchical representations of ontology and
content
Most cases!
Can predict based on similarity
Knowledge Poor
Have only page-level logs
No relational knowledge
Can predict observed pages only when data is
plenty

Interesting!
3
1. Knowledge-poor -- Web-log mining
Association Rule based Predictive Model
A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D
1
2
Window of Prediction
Current Observations
?
Sizem
B
C
A
1
2
Extract rules
Select rules
4
Association Rules

LHS ? RHS
RHS restricted to one URL,
but can be relaxed to more than one URL
RHS the most popular URL with the same LHS
For each W1, if rule applies and RHS is in W2,
then Success!

5
Rule-Representation Methods (min sup2)

Subset
A, C?C
Substring
BC?C
Latest SubstringC?C
Subsequence
Latest Subsequence

6
Rule-Selection Criteria

Among the rules whose LHS matches W1,
Longest-Match Selection
Select a rule whose left hand side is the longest
to apply
Corresponds to using the strongest signature to
predict
Most Confident
Select the rule with highest confidence to apply
Pessimistic Selection
UCF(E,N) is the upper bound on the estimated
error for a given confidence value, assuming a
normal distribution of error

7
Comparison Matrix

Compared with C4.5 as well.
Comparison Criteria Precision Model Size

8
Integrating with Caching

Cache replacement algorithm
A key value K(p) is assigned to each object p
Arlltt et al. USENIX 1998, Cao Irani, 97
K(p) L F(p) C(p) / S(p)
C(p) Cost of loading a page (e.g., amount of
time)
S(p) Size of a page
F(p) Frequency Count of a page
L An Inflation factor to reflect cache aging

9
Predicting future frequency
O1 0.70 O2 0.90 O3 0.30 O4 0.11
Session 1
Predicted Frequency
W1 0.700.600.70 2.00 W2 0.900.700.90
2.50 W3 0.300.20 0.50 W4 0.110.30
0.41 W5 0.420.33 0.75
O1 0.60 O2 0.70 O3 0.20 O5 0.42
Session 2
O1 0.70 O2 0.90 O4 0.30 O5 0.33
Session 3

Ki L ( WiFi ) Ci / Si
Wi Future frequency
Fi Past frequency

10
Byte-hit Rate measures bandwidth reduction
Bytes answered by cache
Byte Hit Rate
Total bytes
11
Relative Network Traffic (NASA)
Prefetch vs. No-prefetch
Fractional latency (NASA)
12
Knowledge Rich The other extreme

Web log mining
User sessions
Markov models
But, sometimes data about specific pages are
sparse!
Cannot train the Markov models properly
A single visitor views 0 of any site
New dynamic content not in training data
Now, many pages are generated automatically
Deep web
Dynamically generated pages
Question if we have the relational knowledge,
what more can we do?

13
Relational Markov Models

RMM ( Relational Markov Model )
Group the same type pages into relations.
Combine low-level and high-level information
Automatically adapt web sites for different users

What relation means
Buys(Student, PC)
Student(ID, Name, Addr)
Relational Algebra is the basis for relational
databases

14
Relational Markov Models
Anderson et al. KDD02

Domains often contain relational structure
Each state is a tuple in relational DB sense
Structure enables state generalization
Which allows learning from sparse data

15
Relational Markov Models
16
Relational Markov Model
17
RMM generalization

Want to estimate P(s ? d) but no data!
Use shrinkage

Can do this with abstractions of d and s
Let ? be an abstraction of s and ? of d

?
?
?
18
Learning and inference

Maximum use of available information
s are non negtive coefficient that
sum to 1
over all abstractions of the source
destination.

19
How to get

Intuitively, the lab should be large when
Abstractions are more specific
Training data is abundant
RMM-uniform
easy, fast, poor result
RMM-EM
poor result
RMM-rank

Slower than Markov model
Generally fast
For
require I(qs)I(qd)
hierarchy is large, slow computation

20
Adaptive Web Navigation

Andersons Algorithm
Personalize web sites based on a persons
browsing pattern
(add a link, rearrange list items, etc)
Step 1 mining web server log to build models
(RMM) of users
Step2 adapt the site for the user

Features
No training data on some pages
Periodical changing pages
Solution to sparse training data
Identifying semantic correspondence between
pages, both visited and unseen

21
Evaluation (Anderson, Weld, Domingos 02)
Gazelle

www.gazelle.com --- a e-commerce site
www.cs.washington.edu/education/courses/
Generate good relational structure
Computation time RMM Markov model

22
III. Knowledge Middle most cases

No relational knowledge
But, has ontological knowledge
Still have the sparse data problem
However
Most web sites have some ontological structure
We can build Markov Models based on these
structures

23
Vector Representation of Web Pages

A feature vector vi is defined by a set of
features vi f1, f2..., fl

Where,P the whole page space,V the whole
feature-vector space
e.g. v1 ltpaths, keywords set,ltout-linking
vectorgt gt
24
The Similarity Function

where wk is the weight of the kth feature, and
S(fik, fjk) is the similarity of two features in
position k.
Features
a) different paths,
b) different keywords
c) different out links

25
Preprocessing ? vector space
26
Markov Model-Based Clustering Algorithm

Model based clustering algorithm
Step 1. Cluster the feature-vector sequences in
to K-clusters by mixture Markov models (EM)
Step 2. For each new sequence s, calculate the
probabilities that it belongs to K clusters
Difference from Candez00
We cluster the feature-vector sequences instead
of pages
Goal classification with unseen data

27
Classification on Test Data