Our Topic: - PowerPoint PPT Presentation

About This Presentation

Title:

Our Topic:

Description:

Our Topic: Web Usage Mining Presented by: Wenzhen Xing & Kun Gao With Guide of: Dr. Bettina Berendt For seminar: Web Mining Pattern Discovery (cont.) Sequential ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 52

Provided by: 123semina

Category:

more less

Transcript and Presenter's Notes

Title: Our Topic:

1
Our Topic Web Usage Mining
Presented by Wenzhen Xing Kun
Gao With Guide of Dr. Bettina Berendt For
seminar Web Mining
2
This is a dynamic and fast changing world!
3
Introduction and Background

More and more organizations rely on the Internet
and the World Wide Web to conduct business.
Generate and collect large volumes of data in
daily operations.

These data are generally gathered automatically
by web servers and collected in server or access
log.
Mining and analyzing these log can provide
valuable information, e.g. targeting ads to
specific groups of users.

Web mining is the application of data mining
techniques to large web data repositories.

6
The Goals of Web Mining also include the
improvement of web design and structure, and
generation of dynamic recommendations.
(Session 1)
7
Overview

Web Mining
Web Usage Mining
Data Source
Three phases of Web Usage Mining
Preprocessing
Pattern Discovery
Pattern Analysis
Application of related softwares
Conclusion

8
Taxonomy of Web Mining
Web Mining
Web Usage Mining
Web Content Mining

Data Integration
Transaction Identification
Pattern Discovery Tools
Pattern Analysis Tools

Database Approach
Agent-based Approach

Intelligent Search Agents
Info. Filtering/Categorization
Personalized Web Agents

Multilevel Databases
Web Query Systems

9
Knowledge Discovery in Databases
interpretation
data mining
KNOWLEDGE
transformation
preprocessing
selection
Patterns
Transformed Data
Preprocessed Data
DATA
Target Data
10

Classification of web data
Content data any complete or synthetic
representation of the resource (the real data)
such as HTML documents, images, sound files, etc
Structure data data describing the structure
and the organization of the content through
internal tags (intra-page) or hyper-links
(inter-page)
User profile data demographic information
derived from registration.
Usage data Data that describes the pattern of
usage of Web pages, such as IP addresses, page
references, and the date and time of accesses.

Data Sources
server level collection the server stores data
regarding requests performed by the client, thus
data regard generally just one source
client level collection it is the client
itself which sends to a repository information
regarding the user's behaviour (can be
implemented by using a remote agent (such as
Javascripts or Java applets) or by modifying the
source code of an existing browser (such as
Mosaic or Mozilla) to enhance its data collection
capabilities. )
proxy level collection information is stored
at the proxy side, thus Web data regards several
Websites, but only users whose Web clients pass
through the proxy.

12
Data Sources
13
Web Server Log
(Session 2)
14
Web Server Access Logs

Typical Data in a Server Access Log

looney.cs.umn.edu han - 09/Aug/1996095352
-0500 "GET mobasher/courses/cs5106/cs5106l1.html
HTTP/1.0" 200 mega.cs.umn.edu njain -
09/Aug/1996095352 -0500 "GET / HTTP/1.0" 200
3291 mega.cs.umn.edu njain - 09/Aug/1996095353
-0500 "GET /images/backgnds/paper.gif HTTP/1.0"
200 3014 mega.cs.umn.edu njain -
09/Aug/1996095412 -0500 "GET
/cgi-bin/Count.cgi?dfCS home.dat\ddC\ft1
HTTP mega.cs.umn.edu njain - 09/Aug/199609541
8 -0500 "GET advisor HTTP/1.0"
302 mega.cs.umn.edu njain - 09/Aug/1996095419
-0500 "GET advisor/ HTTP/1.0" 200
487 looney.cs.umn.edu han - 09/Aug/1996095428
-0500 "GET mobasher/courses/cs5106/cs5106l2.html
HTTP/1.0" 200 . . .
. . . . . .

Access Log Format
IP address userid time method url protocol
status size
mega.cs.umn.edu njain
09/Aug/1996095431 advisor/csci-faq.ht
ml

Other Server Logs referrer logs, agent logs

15
client IP address or hostname user id
('-' if anonymous) access time HTTP
request method (GET, POST, HEAD, ...) path of
the resource on the Web server (identifying the
URL) the protocol used for the transmission
(HTTP/1.0, HTTP/1.1) the status code returned
by the server as response (200 for OK, 404 for
not found, ...) the number of bytes
transmitted.
16
Three Phases
17
Three Phases

Preprocessing
Pattern discovery
Pattern analysis

18
Preprocessing

Convert raw usage data into the data
abstractions.
Most difficult task in Web usage mining due to
the incompleteness of the available data.

19
Usage Preprocessing

A single proxy server may have several users
accessing a Web site, potentially over the same
time period.
Single IP Address/ Multiple Users
--AOL Effect
ISP Proxy Servers
Public Access Machines

20
Usage Preprocessing(cont.)

Multiple IP address/Single User - A user that
accesses the Web from different machines will
have a different IP address from session to
session. This makes tracking repeat visits from
the same user difficult.

21
Usage Preprocessing(cont.)

Multiple IP address/Single Server Session - Some
ISPs or privacy tools randomly assign each
request from a user to one of several IP
addresses. In this case, a single server session
can have multiple IP addresses.
Multiple Agent/Singe User - Again, a user that
uses more than one browser, even on the same
machine,will appear as multiple users.

22
Usage Preprocessing(cont.)

Solutions to the prolem
Cookies - small piece of code that is saved on
the
client machine
Advantages Track same user across
multiple sessions
Disadvantages Can be declined or
deleted. Privacy concerns.
User Login Require user to use login ID with
password
Advantages Unique ID tied to an
individual, not a
machine or browser
Disadvantages Not all users willing to
register.

23
Usage Preprocessing(cont.)

Soultions to the problem(cont.)
Embedded SessionID.
Advantages Cant be turned off.
Disadvantages Cant track repeat
visits.
Lose the first file access of each session.
Client-side tracking ( Modified Browse r)
Advantages Clean, accurate source of
usage data.
Disadvantages Privacy concerns. Can only
track a small percentage of the user population.

24
Usage Preprocessing(cont.)

Other methods
use session time-outs
path completion to infer cached references
EX expanding a session A gt B gt C by an
access pair
(B gt D) results in A gt B gt C
gt B gt D

25
Content Preprocessing

converting the text, image, scripts, and other
files such as multimedia into forms that are
useful for the Web Usage Mining process.
this consists of performing content mining such
as classification or clustering. (also found in
pattern discovery)

26
Pattern Discovery

Pattern discovery draws upon methods and
algorithms developed from several fields such as
statistics, data mining, machine learning and
pattern recognition.

27
Pattern Discovery

Statistics
Association Rules
Clustering
Classification
Sequential Patterns
Path Analysis
etc...

28
Pattern Discovery(cont.)

Statistics
Most common method.
This kind of analysis is performed by many tools,
its aim is to give a description of the traffic
on a Web site, like most visited pages, average
daily hits, etc.
Useful for improving the system performance,
enhancing the security of the system,
facilitating the site modification task, etc.

(Session 3 and 5)
29
Pattern Discovery (cont.)

Association rules
Its main idea is to consider every URL requested
by a user in a visit as basket data (item) and to
discover relationships with a minimum support
level between them
Discover the correlations among references to
various pages of a web site in a single server
session.
Useful for restructuring web site, serving as a
heuristic for pre-fetching docs to reduce
latency.

30
Association Rules (cont.)

discovers affinities among sets of items across
transactions
X gt Y
where X, Y are sets of items,
????confidence,???????support
Examples
60 of clients who accessed /products/, also
accessed /products/software/webminer.htm.
30 of clients who accessed /special-offer.html,
placed an online order in /products/software/.

????
31
Pattern Discovery (cont.)

Clustering
meaningful clusters of URLs can be created by
discovering similar characteristics between them
according to users behaviors.
Usage clusters
Useful to perform market segmentation in
E-commerce or provide personalized Web content to
the users.
Pages clusters
Useful for Internet search engines and web
assistance providers.

32
Pattern Discovery (cont.)

Classification
Develop a profile of users belonging to a
particular class or category.
Require extraction and selection of features that
best describe the properties of a given class or
category.

33
Pattern Discovery (cont.)

Clustering and Classification
clients who often access /products/software/webmin
er.html tend to be from educational institutions.
clients who placed an online order for software
tend to be students in the 20-25 age group and
live in the United States.
75 of clients who download software from
/products/software/demos/ visit between 700 and
1100 pm on weekends.

34
Pattern Discovery (cont.)

Sequential Patterns
Find inter-session patterns such that the
presence of a set of items is followed by another
item in a time-ordered set of sessions or
episodes.
Useful to predict the future behavior of the
clients.
the attempt of this technique is to discover time
ordered sequences of URLs followed by past users,
in order to predict future ones (this is much
used for Web advertisement purposes)

Sequential Patterns
30 of clients who visited /products/software/,
had done a search in Yahoo using the keyword
software before their visit
60 of clients who placed an online order for
WEBMINER, placed another online order for
software within 15 days

36
Pattern Discovery (cont.)

Path Analysis
Types of Path/Usage Information
Most Frequent paths traversed by users
Entry and Exit Points
Distribution of user session durations / User
Attrition
Examples
60 of clients who accessed /home/products/file1.
html, followed the path /home gt /home/whatsnew
gt /home/products gt /home/products/file1.html
(Olympics Web site) 30 of clients who accessed
sport specific pages started from the Sneakpeek
page.
65 of clients left the site after 4 or less
references.

37
Data and Transaction Model for Association Rules

Let L be a set of server access log entries. A
log entry l ? L has the following components
. The IP address of client, denoted l.ip
. The user id for the client, denoted l.uid
. The URL of the page accessed by the client,
denoted by l.url
. The time of access l.time

38
Data and Transaction Model for Association Rules

Definition 1 An association transaction t is a
triple

39
Data and Transaction Model for Association Rules
40
Data and Transaction Model for Association Rules
41
Data and Transaction Model for Sequential Rules

42
Data and Transaction Model for Sequential Rules
43
Application of Association rules
44
Application of Association rules
45
Application of Sequential pattern rules
46
Application of Sequential pattern rules
47
Experimental Results
48
Experimental Results
49
Example Session Inference with Referrer Log
Agent
Time
IP
URL
Referrer
1 www.aol.com 083000 A
Mozillar/2.0 AIX 4.1.4
2 www.aol.com 083001 B E
Mozillar/2.0 AIX 4.1.4
3 www.aol.com 083002 C B
Mozillar/2.0 AIX 4.1.4
4 www.aol.com 083001 B
Mozillar/2.0 Win 95
5 www.aol.com 083003 C B
Mozillar/2.0 Win 95
6 www.aol.com 083004 F
Mozillar/2.0 Win 95
7 www.aol.com 083004 B A
Mozillar/2.0 AIX 4.1.4
8 www.aol.com 083005 G B
Mozillar/2.0 AIX 4.1.4
Identified Sessions S1 gt A gt B gt
G from references 1, 7, 8 S2 E gt B gt C
from references 2, 3 S3 gt B gt
C from references 4, 5 S4 gt F from
reference 6
50
Applications for Web-Based Organizations