Title: Weblog Mining: from Pages to Relations
1Web-log Mining from Pages to Relations
- Qiang Yang
- HKUST,
- Hong Kong, China
2A Spectrum of Web-Log Miners
Knowledge
- Knowledge Rich
- Have logical relations over page contents
- Database generated pages
- Can make accurate predictions even without data!
- Knowledge Middle
- Some hierarchical representations of ontology and
content - Most cases!
- Can predict based on similarity
- Knowledge Poor
- Have only page-level logs
- No relational knowledge
- Can predict observed pages only when data is
plenty
Interesting!
31. Knowledge-poor -- Web-log mining
Association Rule based Predictive Model
A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D
1
2
Window of Prediction
Current Observations
?
Sizem
B
C
A
1
2
Extract rules
Select rules
4Association Rules
- LHS ? RHS
- RHS restricted to one URL,
- but can be relaxed to more than one URL
- RHS the most popular URL with the same LHS
- For each W1, if rule applies and RHS is in W2,
then Success!
5Rule-Representation Methods (min sup2)
- Subset
- A, C?C
- Substring
- BC?C
- Latest SubstringC?C
- Subsequence
- Latest Subsequence
6Rule-Selection Criteria
- Among the rules whose LHS matches W1,
- Longest-Match Selection
- Select a rule whose left hand side is the longest
to apply - Corresponds to using the strongest signature to
predict - Most Confident
- Select the rule with highest confidence to apply
- Pessimistic Selection
- UCF(E,N) is the upper bound on the estimated
error for a given confidence value, assuming a
normal distribution of error
7Comparison Matrix
- Compared with C4.5 as well.
- Comparison Criteria Precision Model Size
8Integrating with Caching
- Cache replacement algorithm
- A key value K(p) is assigned to each object p
- Arlltt et al. USENIX 1998, Cao Irani, 97
- K(p) L F(p) C(p) / S(p)
- C(p) Cost of loading a page (e.g., amount of
time) - S(p) Size of a page
- F(p) Frequency Count of a page
- L An Inflation factor to reflect cache aging
9Predicting future frequency
O1 0.70 O2 0.90 O3 0.30 O4 0.11
Session 1
Predicted Frequency
W1 0.700.600.70 2.00 W2 0.900.700.90
2.50 W3 0.300.20 0.50 W4 0.110.30
0.41 W5 0.420.33 0.75
O1 0.60 O2 0.70 O3 0.20 O5 0.42
Session 2
O1 0.70 O2 0.90 O4 0.30 O5 0.33
Session 3
- Ki L ( WiFi ) Ci / Si
- Wi Future frequency
- Fi Past frequency
10Byte-hit Rate measures bandwidth reduction
Bytes answered by cache
Byte Hit Rate
Total bytes
11Relative Network Traffic (NASA)
Prefetch vs. No-prefetch
Fractional latency (NASA)
12Knowledge Rich The other extreme
- Web log mining
- User sessions
- Markov models
- But, sometimes data about specific pages are
sparse! - Cannot train the Markov models properly
- A single visitor views 0 of any site
- New dynamic content not in training data
- Now, many pages are generated automatically
- Deep web
- Dynamically generated pages
- Question if we have the relational knowledge,
what more can we do?
13Relational Markov Models
- RMM ( Relational Markov Model )
- Group the same type pages into relations.
- Combine low-level and high-level information
- Automatically adapt web sites for different users
- What relation means
- Buys(Student, PC)
- Student(ID, Name, Addr)
- Relational Algebra is the basis for relational
databases
14Relational Markov Models
Anderson et al. KDD02
- Domains often contain relational structure
- Each state is a tuple in relational DB sense
- Structure enables state generalization
- Which allows learning from sparse data
15Relational Markov Models
16Relational Markov Model
17RMM generalization
- Want to estimate P(s ? d) but no data!
- Use shrinkage
- Can do this with abstractions of d and s
- Let ? be an abstraction of s and ? of d
?
?
?
18Learning and inference
- Maximum use of available information
-
- s are non negtive coefficient that
sum to 1 - over all abstractions of the source
destination.
19How to get
- Intuitively, the lab should be large when
- Abstractions are more specific
- Training data is abundant
- RMM-uniform
- easy, fast, poor result
- RMM-EM
- poor result
- RMM-rank
- Slower than Markov model
- Generally fast
- For
- require I(qs)I(qd)
- hierarchy is large, slow computation
20Adaptive Web Navigation
- Andersons Algorithm
- Personalize web sites based on a persons
browsing pattern - (add a link, rearrange list items, etc)
- Step 1 mining web server log to build models
(RMM) of users - Step2 adapt the site for the user
-
- Features
- No training data on some pages
- Periodical changing pages
- Solution to sparse training data
- Identifying semantic correspondence between
pages, both visited and unseen
21Evaluation (Anderson, Weld, Domingos 02)
Gazelle
- www.gazelle.com --- a e-commerce site
- www.cs.washington.edu/education/courses/
- Generate good relational structure
- Computation time RMM Markov model
22III. Knowledge Middle most cases
- No relational knowledge
- But, has ontological knowledge
- Still have the sparse data problem
- However
- Most web sites have some ontological structure
- We can build Markov Models based on these
structures
23Vector Representation of Web Pages
- A feature vector vi is defined by a set of
features vi f1, f2..., fl
Where,P the whole page space,V the whole
feature-vector space
e.g. v1 ltpaths, keywords set,ltout-linking
vectorgt gt
24The Similarity Function
- where wk is the weight of the kth feature, and
S(fik, fjk) is the similarity of two features in
position k. - Features
- a) different paths,
- b) different keywords
- c) different out links
25Preprocessing ? vector space
26Markov Model-Based Clustering Algorithm
- Model based clustering algorithm
- Step 1. Cluster the feature-vector sequences in
to K-clusters by mixture Markov models (EM) - Step 2. For each new sequence s, calculate the
probabilities that it belongs to K clusters - Difference from Candez00
- We cluster the feature-vector sequences instead
of pages - Goal classification with unseen data
27Classification on Test Data
- Measurements on Prediction Performance
- Accuracy
- Recall
28Classification on Test Data
- Suppose the vector sequence is
- v v1v2...vp, where vi may not occur in the
training data. - From similarity matrix M, we get a list (say
top-K) similar vectors to vi. - Use Vij to denote the top-K similar vectors to vi.
29Example of Similarity
- Given a new sequence vlt4 4 5 2 5 9 4gt,
- the feature vector 9 is new.
- The top three similar feature vectors are 4,7,1.
- Hence, we have three candidate sequences
- lt4 4 5 2 5 4 4gt
- lt4 4 5 2 5 7 4gt
- lt4 4 5 2 5 1 4gt
30Test Data COMP102 at HKUST
- COMP102(http//www.cs.ust.hk/liao/comp102/),
- Collected from 31/August/2002 to
17/December/2002, C course to 1st year UG - 281 different Web pages
- These pages are created at different times
- 6,089 individual IP addresses visited
- 255,074 valid requests
- 60 megabytes in flat text format
- 9913 sessions all together
- Average session length is 9.3
31COMP102 data (clicks)
32COMP102, Browsing models
33Test accuracy with/out similarity
- Training first 42 days Testingnext 18 days
- w1 (category) gtgtw2 (keywords), w3 (link)
34Test recall with/out similarity
- Training first 42 days Testingnext 18 days
- w1 (category) gtgtw2 (keywords), w3 (link)
35Test accuracy with/out similarity
- Training every 10 days, Testing next 10 days
- w1 (category) gtw2 (keywords)gt w3 (link)
36Test recall with/out similarity
- Training every 10 days, Testing next 10 days
- w1 (category) gtw2 (keywords)gt w3 (link)
0 to 10 days
last 10 days
37Conclusions and Future Work
- Knowledge
- Poor
- Page-level Markov models
- Middle
- Ontological Markov models
- Rich
- Relational Markov models
- Future Uncover implicit knowledge