Web Usage Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web Usage Mining

1
Web Usage Mining

Chris Yang

2
Three Phases of Web Usage Mining

Discover usage patterns from Web data to
understand and better serve the needs of
Web-based applications (Srivastava et al., 2000)
Three phases
Preprocessing
Pattern discovery
Pattern analysis

3
(No Transcript)
4
Motivation of Web Usage Mining

Bring vendor and end customer in electronic
commerce closer
Mass customization
Vendor may personalize his product message for
individual customers at a massive scale

5
Data Sources

Sever
Web server log explicitly records the browsing
behavior of site visitors and reflects the access
of a Web site by multiple users
Formats
Common log
Extended log
Web log may not be completely reliable
Caching files stored at client but not accessed
from server
Information pass through the POST method will not
be available in a server log

6
HTTP

The Web's RPC on top of TCP/IP
It is stateless, which means that a separate
connection is made for every request
Simple to implement, yet incur overhead
Each HTTP client/server interaction consists of
a single request/reply interchange
HTTP request
HTTP response

HTTP request message consists of
request line
method or command to apply to a server resource
e.g. GET, POST
URL (without protocol and server domain name)
the protocol version used by the client, e.g.
HTTP/1.0
request header fields
Pass additional information about the request and
the client itself to the server - much like RPC
parameters
Each header filed consists of a name, followed by
and the field value
the entity body (optional)
Clients use it to pass bulk information to the
server (CGI)

Examples of HTTP methods
GET - retrieve the specified URL
POST - send this data to the specified URL
Examples of HTTP header fields
Accept - lists acceptable MIME type/subtype
contents
User-Agent - provides client browser information

Note crlf carriage-return/line-feed
8

HTTP response message
response header line
HTTP version, the status of the response, and an
explanation of the returned status
response header fields
Information that describes the server's
attributes and the returned HTML document to
client
entity body
Contains an HTML document that a client has
requested
Each HTML document needs a separate request
message
stateless

The result code 200 indicates that the request is
successful.

9
Data Source - Server

Web server log in extended log format

10
Data Source - Server

Packet sniffing
Monitor network traffic coming to a Web server
Extract usage data directly from TCP/IP packets
Cookies
Tokens generated by the Web server for individual
client browsers to automatically track the site
visitor
HTTP protocol is stateless which makes tracking
individual users difficult
Cookies rely on implicit user cooperation
Query data
CGI scripts
URI for CGI programs may contain additional
parameter values to be passed to CGI applications

11
Data Source - Client

Client
Remote agent (e.g. Javavscripts or Java applets)
Modifying the source code of an existing browser
to enhance data collection capabilities
Difficulty - Require client cooperation to enable
the functionality of Javascripts and Java Applets
or voluntarily use of the modified browsers

12
Data Source - Proxy

Proxy
Caching between client browsers and Web servers
Proxy traces may reveal the actual HTTP request
from multiple clients to multiple Web servers
It helps to characterizing the browsing behavior
of a group of anonymous users sharing a common
proxy server

13
Data Abstractions

Data from server, client and proxy helps us to
construct data abstractions
Users, server sessions, episodes, click-streams,
and page views
W3C Web Characterization Activity (WCA) has
drafted a Web term definitions relevant to Web
usage (http//www.w3.org/WCA)
User a single individual that is accessing file
from one or more Web servers through a browser
Difficulty to identify user a user may access
through different machines or use more than one
agent on a single machine
Page view page view consists of every file that
contributes to the display on a users browser at
one time
Includes several files such as frames, graphics,
and scripts
When users download a Web page by clicking an
anchor text or submitting an URL, he/she is not
aware of how many frames, graphics, images, or
scripts he/she is receiving
Click-stream a sequential series of page view
requests
Server may not have all information to obtain the
click-stream
Page views through client or proxy-level cache
are not available at server
User session the click-stream of page views for
a single user across the entire Web
In practice, only the portion of user session
that is accessing a particular site can be
identified.
Server session the set of page views in a user
session for a particular Web site
Episode any semantically meaningful subset of a
user or server session

14
Phase 1 Preprocessing

Usage Preprocessing
Due to the incompleteness of available data,
usage preprocessing is a difficult task
Typical problems
Unless client side tracking is used, only IP
address, agent, and server-side click stream are
available
Single IP address / Multiple server sessions
Internet service providers (ISPs) have a pool of
proxy servers
A proxy server may have several users accessing a
Web site, potentially over the same time period
Multiple IP address / Single server sessions
Some ISPs or privacy tools randomly assign each
request from a user to one of several IP
addresses
Multiple IP address / Single user
A user accesses the Web from different machines
(multiple IP address from session to session)
Multiple agent / Single user
A user uses more than one browser appears as
multiple users

15
Usage Preprocessing

Segmenting click-stream into sessions
It is difficult to know when a user leave a Web
site
A thirty-minute time out is often used (Catledge
and Pitkow, 1995)
In some cases, session ID is embedded in each
URI, session is defined by content server
Content from user action
Content servers maintain state variables for each
active session, the information to determine the
content by a user request is not always available

Using referrer and agent information, 4 sessions
are determined

17
Content Preprocessing and Structure Preprocessing

Content Preprocessing
Converting the text, image, scripts, and other
multimedia files into forms that are useful for
Web usage mining
Classification
By content
By intended use (Cooley et al., 1999 Pirolli et
al., 1996)
Convey information, gather information from user,
allow navigation, or combination
Structure Preprocessing
Hyperlinks between page views

18
Phase 2 Pattern Discovery

Statistical Analysis
Perform descriptive statistical analysis (such as
mean, median, frequency etc.) on page views,
viewing time and length of a navigational path
from session file
Web traffic analysis tools produce periodic
reports
Most frequently accessed pages
Average view time of a page
Average length of a path through a site
Useful for improving the system performance,
enhancing the security of the system,
facilitating the site modification task, and
providing support for marketing decisions

Association Rules
Relate pages that are most often referenced
together in a single server session
Sets of pages that are accessed together with a
support value exceeding some specified threshold
These page may not directed connected by
hyperlinks
Useful for Web designers to restructure their Web
sites
These rules serve as a heuristic for prefetching
documents in order to reduce user-perceived
latency when loading a page from a remote site

Clustering
Group together a set of items having similar
characteristics
Usage clusters
Establish groups of users exhibiting similar
browsing patterns
Useful for inferring user demographics in order
to perform market segmentation
Page clusters
Discover groups of pages that have related
content
Useful for search engines and Web assistance
providers

Classification
Mapping a data item into one of several
predefined classes
Develop a profile of users belonging to a
particular class or category
Requires feature extraction and selection that
best describe the properties of a given class or
category
Techniques
Decision tree classifiers, naïve Bayesian
classifier, k-nearest neighbor classifiers,
support vector machines, etc.
E.g.
30 users who place online orders in
/Product/Music are in the 19-25 age group and
live on the West coast

Sequential Pattern
Find inter-session patterns
The presence of a set of items is followed by
another item in a time-ordered set of sessions or
episode
Useful for predicting future pattern in order to
place advertisements for a certain user groups
Temporal analysis
Trend analysis, change point detection, or
similarity analysis

Dependency Modeling
Develop a model capable of representing
significant dependencies among the various
variables in the Web domain
E.g.
A model representing the different stages a
visitor undergoes while shopping in an online
store based on the action chosen (from casual
visitor to a serious potential buyer)
Techniques
Hidden Markov models, Bayesian belief network

24
Phase 3 Pattern Analysis

Filter out uninteresting rules or patterns from
the set found in the pattern discovery phase

25
Major Application Areas for Web Usage Mining
(Sriastava et al., 2000)
26
Architecture of the WebSIFT system (Cooley et
al., 1999)
27
WUM Web Usage MinerNavigation behavior in Web
sites(Berendt and Spiliopoulou, 2000)

Web site is a network of structurally or
semantically interrelated nodes (built in a way
that reflects the designers intuition).
Quality of a Web site
The conformance of the Web sites structure to
the intuition of each group of visitors accessing
the site.
Intuition of visitors is indirectly reflected in
their navigation behavior (represented in the
browsing pattern)
Measure of the quality of Web site
Quality of service (e.g. response time)
Quality of navigation
Accessibility
Information utility
Ease of use
Attractiveness of the presentation metaphor

28
Sequence Mining

Sequence mining supports the discovery of
frequent paths composed of not necessarily
adjacent pages
Given a collection of transactions ordered in
time (each transaction contains a set of items),
discover sequences of maximal length with support
above a given threshold
A sequence is an ordered list of elements, an
element being a set of items appearing together
in a transaction
Elements need not be adjacent in time but their
ordering in a sequence must not violate the time
ordering of the support transactions
Example
Considering a Web site with pages W, A, B, C, D,
E and there is a link from W to D
WABC (1000 times), WDBC (100 times), WABDEC (400
times)
Frequency threshold 25
WD appears 500 (400100) times (33) and above
threshold
In the above example, link from W to D only used
1 out of 5 cases. Therefore, sequence mining is
not useful in understanding the usefulness of a
hyperlink.
In WUM, a navigation pattern is a directed
acyclic graph composed of a group of sequences
that conform to a template
The purpose is to determine the usage of which
links is responsible for the frequency of
sequences

29
WUM Navigation Sequences and Navigation
Patterns

A session is a directed list of page accesses
performed by a user during his/her visit in a
site
A navigation pattern is a structure that
Emphasizes the common parts among the sessions
Does not purge the dissimilar parts
Annotates both common and non-common parts with
quantitative information
P is a set of Web pages in the site
If the site is dynamic nature, P is the set of
all pages that can be generated
D is a dataset of sessions
A session is a directed list of elements from P
A sequence of length n is a vector s? P ? N (N is
a set of positive integers)
U P ? N
Example
P a,b,c,d,e,f,g,h
ab, ac, abcde, bcbf, abdfhe are sessions
appearing in D

30
Generalized sequences

wildcard low high is matched by any sequence
of elements that has length at least low and at
most high (low ? 0, high ? low)
wildcard ?- its range is not of interest
A generalized sequence g is a vector g1 ? g2 ? ?
gn
The number of non-wildcard elements in g is the
length of g, length(g)
Example
(a,1) ? (b,1) 24 (e,1) matches with Session 3
and 5
The group of sequences that match g constitute
the navigation pattern of g navp(g)
The hits of g, hits(g), is the number of
sequences that matched by g.
confidence(gi, gj, g) hits(g1 ? gi-1 ? gi) /
hits(g1 ? gj)
g (a,1) ? (b,1) 24 (e,1)
hits(g) 30 10 40

31
Aggregate tree and log

navp(g) is modeled as a tree structure (aggregate
tree)
Aggregate log

32
Discover navigation pattern

A template is a vector comprised of variable
ranging over the domain U and of wildcards
A mining query is a template declaration
accompanied by a conjunction of constraints on
the permissible values of the template variables
Example
NODE AS x y z
TEMPLATE x ? y 24 z AS t
WHERE x.support ? 85
AND (y.support / x.support ) ? 0.8

Write a Comment

User Comments (0)

About PowerShow.com

Web Usage Mining PowerPoint PPT Presentation