Title: Web Analytics
1Web Analytics
- Xuejiao Liu
- INF 385F WIRED
- Fall 2004
2Outline
- Introduction
- What is Web Analytics
- Why Web Analytics matter
- Secondary readings
- Log files analysis
- Web usage mining
- Data preparation
- KDD process
- Document access in repositories
3Log File Lowdown(Michael Calore, 2001 )
- Log file
- What are in log file
- Traffic
- Audience
- Browsers/Platforms
- Errors
- Referers
4Log File Lowdown
- Sample Log File
- adsl-63-183-164.ilm.bellsouth.net - -
09/May/2001134207 -0700 - "GET /about.htm HTTP/1.1" 200 3741
- http//www.e-angelica.com
- "Mozilla/4.0 (compatible MSIE 5.0 Windows 98)"
- Log File Analyzers
- WebTrends, Sawmill, Analog, Webalizer,
HTTP-analyze
5WebTrends
- log file analyzer
- Advantages
- Fast and effective
- User-friendly interface
- Feature-rich
- Support different operating systems
- Disadvantages
- Not free
6WebTrends
7The KDD Process for Extracting Useful Knowledge
from Volumes of Data (Fayyad, U., G.
Piatetsky-Shapiro, et al. 1996)
- KDD Knowledge Discovery in Databases
- The value of data
- Definitions
- KDD
- Data mining
8The KDD Process
The KDD process 1.Creating a target
dataset 2.Preprocessing and data
cleaning 3.Data reduction and projection 4.Data
mining Choosing the data mining function Choosing
the data mining algorithm 5.Interpretation and
evaluation
9The KDD Process
- Data Mining
- Data mining involves fitting models to or
determining patterns from observed data - Data mining algorithms
- The model
- The preference criterion
- The search algorithm
10The KDD Process
- Data Mining
- Model functions
- Classification
- Regression
- Clustering
- Dependency modeling
- Link anlysis
- Goals of Data Mining
- Predictive and descriptive
11Data Preparation for Mining World Wide Web
Browsing Patterns (Cooley, R. W., B. Mobasher,
et al. 1999)
- Web Usage Mining vs. data mining
- The WEBMINER process
- Preprocessing
- Mining algorithms
- Pattern Analysis
12Data Preparation
- Preprocessing
- Data cleaning
- User identification
- Session identification
- Path completion
- Formatting
13Data Preparation
14 Data Preparation
15Tracking the Growth of a Site ( Nielsen, Jakob,
1998)
- Exponential growth of the web and the internet
- Statistical method
- Logarithmic convert to get linear regression
Statistical analysis - Hypothesis the site is growing (number of
pageviews and date are correlated) - R2 and significance
16Tracking the Growth of a Site
R2 0.96, p
17Tracking the Growth of a Site
- Predict growth rate
- Clean noise
- Confident interval
18Predicting Document Access in Large, Multimedia
Repositories(by Recker, M. R. and J. E. Pitkow,
1996)
- patterns of document requests in
network-accessible multimedia databases - Main idea
- Two related domains Human memory and libraries
- Borrow models and research results from them
19Predicting Document Access
- The model human memory (Anderson and Schooler)
- The relationship of recency and performance is a
power function - The relationship of frequency and performance is
a power function - Tow parameters for performance
- Need probability p and Need odds p/(1-p)
- The linear function
- Log(Need odds) a Log(Frequency) b
20Predicting Document Access
- Apply Human Memory Analysis in Document Requests
Model - Dataset log file of Georgia Tech WWW repository
- A dynamic information ecology
- Frequency analysis
- Regression equation
- Log(Need Odds) .99 Log (Frequency) 1.30
- Recency analysis
- Regression equation
- Log(Need Odds) -1.15 Log(days) .41
- Combining recency and frequency
21Predicting Document Access
- Conclusion
- Recency and frequency of past document access are
strong predictors of future document access - Recency probed to be a stronger predictor than
frequency - Applications for the design of information
systems - Determine optimal ordering of retrieved items
- Inform design decisions
- Design of caching algorithms