Title: Our Topic:
1Our Topic Web Usage Mining
Presented by Wenzhen Xing Kun
Gao With Guide of Dr. Bettina Berendt For
seminar Web Mining
2This is a dynamic and fast changing world!
3Introduction and Background
- More and more organizations rely on the Internet
and the World Wide Web to conduct business. - Generate and collect large volumes of data in
daily operations.
4- These data are generally gathered automatically
by web servers and collected in server or access
log. - Mining and analyzing these log can provide
valuable information, e.g. targeting ads to
specific groups of users.
5- Web mining is the application of data mining
techniques to large web data repositories.
6The Goals of Web Mining also include the
improvement of web design and structure, and
generation of dynamic recommendations.
(Session 1)
7Overview
- Web Mining
- Web Usage Mining
- Data Source
- Three phases of Web Usage Mining
- Preprocessing
- Pattern Discovery
- Pattern Analysis
- Application of related softwares
- Conclusion
8Taxonomy of Web Mining
Web Mining
Web Usage Mining
Web Content Mining
- Data Integration
- Transaction Identification
- Pattern Discovery Tools
- Pattern Analysis Tools
Database Approach
Agent-based Approach
- Intelligent Search Agents
- Info. Filtering/Categorization
- Personalized Web Agents
- Multilevel Databases
- Web Query Systems
9Knowledge Discovery in Databases
interpretation
data mining
KNOWLEDGE
transformation
preprocessing
selection
Patterns
Transformed Data
Preprocessed Data
DATA
Target Data
10- Classification of web data
- Content data any complete or synthetic
representation of the resource (the real data)
such as HTML documents, images, sound files, etc - Structure data data describing the structure
and the organization of the content through
internal tags (intra-page) or hyper-links
(inter-page) - User profile data demographic information
derived from registration. - Usage data Data that describes the pattern of
usage of Web pages, such as IP addresses, page
references, and the date and time of accesses.
11- Data Sources
- server level collection the server stores data
regarding requests performed by the client, thus
data regard generally just one source - client level collection it is the client
itself which sends to a repository information
regarding the user's behaviour (can be
implemented by using a remote agent (such as
Javascripts or Java applets) or by modifying the
source code of an existing browser (such as
Mosaic or Mozilla) to enhance its data collection
capabilities. ) - proxy level collection information is stored
at the proxy side, thus Web data regards several
Websites, but only users whose Web clients pass
through the proxy.
12Data Sources
13Web Server Log
(Session 2)
14Web Server Access Logs
- Typical Data in a Server Access Log
looney.cs.umn.edu han - 09/Aug/1996095352
-0500 "GET mobasher/courses/cs5106/cs5106l1.html
HTTP/1.0" 200 mega.cs.umn.edu njain -
09/Aug/1996095352 -0500 "GET / HTTP/1.0" 200
3291 mega.cs.umn.edu njain - 09/Aug/1996095353
-0500 "GET /images/backgnds/paper.gif HTTP/1.0"
200 3014 mega.cs.umn.edu njain -
09/Aug/1996095412 -0500 "GET
/cgi-bin/Count.cgi?dfCS home.dat\ddC\ft1
HTTP mega.cs.umn.edu njain - 09/Aug/199609541
8 -0500 "GET advisor HTTP/1.0"
302 mega.cs.umn.edu njain - 09/Aug/1996095419
-0500 "GET advisor/ HTTP/1.0" 200
487 looney.cs.umn.edu han - 09/Aug/1996095428
-0500 "GET mobasher/courses/cs5106/cs5106l2.html
HTTP/1.0" 200 . . .
. . . . . .
- Access Log Format
- IP address userid time method url protocol
status size - mega.cs.umn.edu njain
09/Aug/1996095431 advisor/csci-faq.ht
ml
- Other Server Logs referrer logs, agent logs
15 client IP address or hostname user id
('-' if anonymous) access time HTTP
request method (GET, POST, HEAD, ...) path of
the resource on the Web server (identifying the
URL) the protocol used for the transmission
(HTTP/1.0, HTTP/1.1) the status code returned
by the server as response (200 for OK, 404 for
not found, ...) the number of bytes
transmitted.
16Three Phases
17Three Phases
- Preprocessing
- Pattern discovery
- Pattern analysis
18Preprocessing
- Convert raw usage data into the data
abstractions. - Most difficult task in Web usage mining due to
the incompleteness of the available data.
19Usage Preprocessing
- A single proxy server may have several users
accessing a Web site, potentially over the same
time period. - Single IP Address/ Multiple Users
- --AOL Effect
- ISP Proxy Servers
- Public Access Machines
20Usage Preprocessing(cont.)
- Multiple IP address/Single User - A user that
accesses the Web from different machines will
have a different IP address from session to
session. This makes tracking repeat visits from
the same user difficult.
21Usage Preprocessing(cont.)
- Multiple IP address/Single Server Session - Some
ISPs or privacy tools randomly assign each
request from a user to one of several IP
addresses. In this case, a single server session
can have multiple IP addresses. - Multiple Agent/Singe User - Again, a user that
uses more than one browser, even on the same
machine,will appear as multiple users.
22Usage Preprocessing(cont.)
- Solutions to the prolem
- Cookies - small piece of code that is saved on
the - client machine
- Advantages Track same user across
multiple sessions - Disadvantages Can be declined or
deleted. Privacy concerns. - User Login Require user to use login ID with
password - Advantages Unique ID tied to an
individual, not a - machine or browser
- Disadvantages Not all users willing to
register.
23Usage Preprocessing(cont.)
- Soultions to the problem(cont.)
- Embedded SessionID.
- Advantages Cant be turned off.
- Disadvantages Cant track repeat
visits. - Lose the first file access of each session.
- Client-side tracking ( Modified Browse r)
- Advantages Clean, accurate source of
- usage data.
- Disadvantages Privacy concerns. Can only
- track a small percentage of the user population.
24Usage Preprocessing(cont.)
- Other methods
- use session time-outs
- path completion to infer cached references
- EX expanding a session A gt B gt C by an
access pair - (B gt D) results in A gt B gt C
gt B gt D
25Content Preprocessing
- converting the text, image, scripts, and other
files such as multimedia into forms that are
useful for the Web Usage Mining process. - this consists of performing content mining such
as classification or clustering. (also found in
pattern discovery)
26Pattern Discovery
- Pattern discovery draws upon methods and
algorithms developed from several fields such as
statistics, data mining, machine learning and
pattern recognition.
27Pattern Discovery
- Statistics
- Association Rules
- Clustering
- Classification
- Sequential Patterns
- Path Analysis
- etc...
28Pattern Discovery(cont.)
- Statistics
- Most common method.
- This kind of analysis is performed by many tools,
its aim is to give a description of the traffic
on a Web site, like most visited pages, average
daily hits, etc. - Useful for improving the system performance,
enhancing the security of the system,
facilitating the site modification task, etc.
(Session 3 and 5)
29Pattern Discovery (cont.)
- Association rules
- Its main idea is to consider every URL requested
by a user in a visit as basket data (item) and to
discover relationships with a minimum support
level between them - Discover the correlations among references to
various pages of a web site in a single server
session. - Useful for restructuring web site, serving as a
heuristic for pre-fetching docs to reduce
latency. -
30 Association Rules (cont.)
- discovers affinities among sets of items across
transactions - X gt Y
- where X, Y are sets of items,
????confidence,???????support - Examples
- 60 of clients who accessed /products/, also
accessed /products/software/webminer.htm. - 30 of clients who accessed /special-offer.html,
placed an online order in /products/software/.
????
31Pattern Discovery (cont.)
- Clustering
- meaningful clusters of URLs can be created by
discovering similar characteristics between them
according to users behaviors. - Usage clusters
- Useful to perform market segmentation in
E-commerce or provide personalized Web content to
the users. - Pages clusters
- Useful for Internet search engines and web
assistance providers.
32Pattern Discovery (cont.)
- Classification
- Develop a profile of users belonging to a
particular class or category. - Require extraction and selection of features that
best describe the properties of a given class or
category.
33Pattern Discovery (cont.)
- Clustering and Classification
- clients who often access /products/software/webmin
er.html tend to be from educational institutions. - clients who placed an online order for software
tend to be students in the 20-25 age group and
live in the United States. - 75 of clients who download software from
/products/software/demos/ visit between 700 and
1100 pm on weekends.
34Pattern Discovery (cont.)
- Sequential Patterns
- Find inter-session patterns such that the
presence of a set of items is followed by another
item in a time-ordered set of sessions or
episodes. - Useful to predict the future behavior of the
clients. - the attempt of this technique is to discover time
ordered sequences of URLs followed by past users,
in order to predict future ones (this is much
used for Web advertisement purposes)
35- Sequential Patterns
- 30 of clients who visited /products/software/,
had done a search in Yahoo using the keyword
software before their visit - 60 of clients who placed an online order for
WEBMINER, placed another online order for
software within 15 days
36Pattern Discovery (cont.)
- Path Analysis
- Types of Path/Usage Information
- Most Frequent paths traversed by users
- Entry and Exit Points
- Distribution of user session durations / User
Attrition - Examples
- 60 of clients who accessed /home/products/file1.
html, followed the path /home gt /home/whatsnew
gt /home/products gt /home/products/file1.html - (Olympics Web site) 30 of clients who accessed
sport specific pages started from the Sneakpeek
page. - 65 of clients left the site after 4 or less
references.
37Data and Transaction Model for Association Rules
- Let L be a set of server access log entries. A
log entry l ? L has the following components - . The IP address of client, denoted l.ip
- . The user id for the client, denoted l.uid
- . The URL of the page accessed by the client,
denoted by l.url - . The time of access l.time
38Data and Transaction Model for Association Rules
- Definition 1 An association transaction t is a
triple -
39Data and Transaction Model for Association Rules
40Data and Transaction Model for Association Rules
41Data and Transaction Model for Sequential Rules
42Data and Transaction Model for Sequential Rules
43Application of Association rules
44Application of Association rules
45Application of Sequential pattern rules
46Application of Sequential pattern rules
47Experimental Results
48Experimental Results
49Example Session Inference with Referrer Log
Agent
Time
IP
URL
Referrer
1 www.aol.com 083000 A
Mozillar/2.0 AIX 4.1.4
2 www.aol.com 083001 B E
Mozillar/2.0 AIX 4.1.4
3 www.aol.com 083002 C B
Mozillar/2.0 AIX 4.1.4
4 www.aol.com 083001 B
Mozillar/2.0 Win 95
5 www.aol.com 083003 C B
Mozillar/2.0 Win 95
6 www.aol.com 083004 F
Mozillar/2.0 Win 95
7 www.aol.com 083004 B A
Mozillar/2.0 AIX 4.1.4
8 www.aol.com 083005 G B
Mozillar/2.0 AIX 4.1.4
Identified Sessions S1 gt A gt B gt
G from references 1, 7, 8 S2 E gt B gt C
from references 2, 3 S3 gt B gt
C from references 4, 5 S4 gt F from
reference 6
50Applications for Web-Based Organizations
- Electronic Commerce
- determine lifetime value of clients
- design cross marketing strategies across products
- evaluate promotional campaigns
- target electronic ads and coupons at user groups
based on their access patterns - predict user behavior based on previously learned
rules and users profile - present dynamic information to users based on
their interests and profiles
51Other Applications
- Effective and Efficient Web Presence
- determine the best way to structure the Web site
- identify weak links for elimination or
enhancement - A site-specific web design agent
- prefetch files that are most likely to be
accessed - Intra-Organizational Applications
- enhance workgroup management communication
- evaluate Intranet effectiveness and identify
structural needs requirements