Our Topic: - PowerPoint PPT Presentation

About This Presentation
Title:

Our Topic:

Description:

Our Topic: Web Usage Mining Presented by: Wenzhen Xing & Kun Gao With Guide of: Dr. Bettina Berendt For seminar: Web Mining Pattern Discovery (cont.) Sequential ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 52
Provided by: 123semina
Category:

less

Transcript and Presenter's Notes

Title: Our Topic:


1
Our Topic Web Usage Mining
Presented by Wenzhen Xing Kun
Gao With Guide of Dr. Bettina Berendt For
seminar Web Mining
2
This is a dynamic and fast changing world!
3
Introduction and Background
  • More and more organizations rely on the Internet
    and the World Wide Web to conduct business.
  • Generate and collect large volumes of data in
    daily operations.

4
  • These data are generally gathered automatically
    by web servers and collected in server or access
    log.
  • Mining and analyzing these log can provide
    valuable information, e.g. targeting ads to
    specific groups of users.

5
  • Web mining is the application of data mining
    techniques to large web data repositories.

6
The Goals of Web Mining also include the
improvement of web design and structure, and
generation of dynamic recommendations.
(Session 1)
7
Overview
  • Web Mining
  • Web Usage Mining
  • Data Source
  • Three phases of Web Usage Mining
  • Preprocessing
  • Pattern Discovery
  • Pattern Analysis
  • Application of related softwares
  • Conclusion

8
Taxonomy of Web Mining
Web Mining
Web Usage Mining
Web Content Mining
  • Data Integration
  • Transaction Identification
  • Pattern Discovery Tools
  • Pattern Analysis Tools

Database Approach
Agent-based Approach
  • Intelligent Search Agents
  • Info. Filtering/Categorization
  • Personalized Web Agents
  • Multilevel Databases
  • Web Query Systems

9
Knowledge Discovery in Databases
interpretation
data mining
KNOWLEDGE
transformation
preprocessing
selection
Patterns
Transformed Data
Preprocessed Data
DATA
Target Data
10
  • Classification of web data
  • Content data any complete or synthetic
    representation of the resource (the real data)
    such as HTML documents, images, sound files, etc
  • Structure data data describing the structure
    and the organization of the content through
    internal tags (intra-page) or hyper-links
    (inter-page)
  • User profile data demographic information
    derived from registration.
  • Usage data Data that describes the pattern of
    usage of Web pages, such as IP addresses, page
    references, and the date and time of accesses.

11
  • Data Sources
  • server level collection the server stores data
    regarding requests performed by the client, thus
    data regard generally just one source
  • client level collection it is the client
    itself which sends to a repository information
    regarding the user's behaviour (can be
    implemented by using a remote agent (such as
    Javascripts or Java applets) or by modifying the
    source code of an existing browser (such as
    Mosaic or Mozilla) to enhance its data collection
    capabilities. )
  • proxy level collection information is stored
    at the proxy side, thus Web data regards several
    Websites, but only users whose Web clients pass
    through the proxy.

12
Data Sources
13
Web Server Log
(Session 2)
14
Web Server Access Logs
  • Typical Data in a Server Access Log

looney.cs.umn.edu han - 09/Aug/1996095352
-0500 "GET mobasher/courses/cs5106/cs5106l1.html
HTTP/1.0" 200 mega.cs.umn.edu njain -
09/Aug/1996095352 -0500 "GET / HTTP/1.0" 200
3291 mega.cs.umn.edu njain - 09/Aug/1996095353
-0500 "GET /images/backgnds/paper.gif HTTP/1.0"
200 3014 mega.cs.umn.edu njain -
09/Aug/1996095412 -0500 "GET
/cgi-bin/Count.cgi?dfCS home.dat\ddC\ft1
HTTP mega.cs.umn.edu njain - 09/Aug/199609541
8 -0500 "GET advisor HTTP/1.0"
302 mega.cs.umn.edu njain - 09/Aug/1996095419
-0500 "GET advisor/ HTTP/1.0" 200
487 looney.cs.umn.edu han - 09/Aug/1996095428
-0500 "GET mobasher/courses/cs5106/cs5106l2.html
HTTP/1.0" 200 . . .
. . . . . .
  • Access Log Format
  • IP address userid time method url protocol
    status size
  • mega.cs.umn.edu njain
    09/Aug/1996095431 advisor/csci-faq.ht
    ml
  • Other Server Logs referrer logs, agent logs

15
client IP address or hostname user id
('-' if anonymous) access time HTTP
request method (GET, POST, HEAD, ...) path of
the resource on the Web server (identifying the
URL) the protocol used for the transmission
(HTTP/1.0, HTTP/1.1) the status code returned
by the server as response (200 for OK, 404 for
not found, ...) the number of bytes
transmitted.
16
Three Phases
17
Three Phases
  • Preprocessing
  • Pattern discovery
  • Pattern analysis

18
Preprocessing
  • Convert raw usage data into the data
    abstractions.
  • Most difficult task in Web usage mining due to
    the incompleteness of the available data.

19
Usage Preprocessing
  • A single proxy server may have several users
    accessing a Web site, potentially over the same
    time period.
  • Single IP Address/ Multiple Users
  • --AOL Effect
  • ISP Proxy Servers
  • Public Access Machines

20
Usage Preprocessing(cont.)
  • Multiple IP address/Single User - A user that
    accesses the Web from different machines will
    have a different IP address from session to
    session. This makes tracking repeat visits from
    the same user difficult.

21
Usage Preprocessing(cont.)
  • Multiple IP address/Single Server Session - Some
    ISPs or privacy tools randomly assign each
    request from a user to one of several IP
    addresses. In this case, a single server session
    can have multiple IP addresses.
  • Multiple Agent/Singe User - Again, a user that
    uses more than one browser, even on the same
    machine,will appear as multiple users.

22
Usage Preprocessing(cont.)
  • Solutions to the prolem
  • Cookies - small piece of code that is saved on
    the
  • client machine
  • Advantages Track same user across
    multiple sessions
  • Disadvantages Can be declined or
    deleted. Privacy concerns.
  • User Login Require user to use login ID with
    password
  • Advantages Unique ID tied to an
    individual, not a
  • machine or browser
  • Disadvantages Not all users willing to
    register.

23
Usage Preprocessing(cont.)
  • Soultions to the problem(cont.)
  • Embedded SessionID.
  • Advantages Cant be turned off.
  • Disadvantages Cant track repeat
    visits.
  • Lose the first file access of each session.
  • Client-side tracking ( Modified Browse r)
  • Advantages Clean, accurate source of
  • usage data.
  • Disadvantages Privacy concerns. Can only
  • track a small percentage of the user population.

24
Usage Preprocessing(cont.)
  • Other methods
  • use session time-outs
  • path completion to infer cached references
  • EX expanding a session A gt B gt C by an
    access pair
  • (B gt D) results in A gt B gt C
    gt B gt D

25
Content Preprocessing
  • converting the text, image, scripts, and other
    files such as multimedia into forms that are
    useful for the Web Usage Mining process.
  • this consists of performing content mining such
    as classification or clustering. (also found in
    pattern discovery)

26
Pattern Discovery
  • Pattern discovery draws upon methods and
    algorithms developed from several fields such as
    statistics, data mining, machine learning and
    pattern recognition.

27
Pattern Discovery
  • Statistics
  • Association Rules
  • Clustering
  • Classification
  • Sequential Patterns
  • Path Analysis
  • etc...

28
Pattern Discovery(cont.)
  • Statistics
  • Most common method.
  • This kind of analysis is performed by many tools,
    its aim is to give a description of the traffic
    on a Web site, like most visited pages, average
    daily hits, etc.
  • Useful for improving the system performance,
    enhancing the security of the system,
    facilitating the site modification task, etc.

(Session 3 and 5)
29
Pattern Discovery (cont.)
  • Association rules
  • Its main idea is to consider every URL requested
    by a user in a visit as basket data (item) and to
    discover relationships with a minimum support
    level between them
  • Discover the correlations among references to
    various pages of a web site in a single server
    session.
  • Useful for restructuring web site, serving as a
    heuristic for pre-fetching docs to reduce
    latency.

30
Association Rules (cont.)
  • discovers affinities among sets of items across
    transactions
  • X gt Y
  • where X, Y are sets of items,
    ????confidence,???????support
  • Examples
  • 60 of clients who accessed /products/, also
    accessed /products/software/webminer.htm.
  • 30 of clients who accessed /special-offer.html,
    placed an online order in /products/software/.

????
31
Pattern Discovery (cont.)
  • Clustering
  • meaningful clusters of URLs can be created by
    discovering similar characteristics between them
    according to users behaviors.
  • Usage clusters
  • Useful to perform market segmentation in
    E-commerce or provide personalized Web content to
    the users.
  • Pages clusters
  • Useful for Internet search engines and web
    assistance providers.

32
Pattern Discovery (cont.)
  • Classification
  • Develop a profile of users belonging to a
    particular class or category.
  • Require extraction and selection of features that
    best describe the properties of a given class or
    category.

33
Pattern Discovery (cont.)
  • Clustering and Classification
  • clients who often access /products/software/webmin
    er.html tend to be from educational institutions.
  • clients who placed an online order for software
    tend to be students in the 20-25 age group and
    live in the United States.
  • 75 of clients who download software from
    /products/software/demos/ visit between 700 and
    1100 pm on weekends.

34
Pattern Discovery (cont.)
  • Sequential Patterns
  • Find inter-session patterns such that the
    presence of a set of items is followed by another
    item in a time-ordered set of sessions or
    episodes.
  • Useful to predict the future behavior of the
    clients.
  • the attempt of this technique is to discover time
    ordered sequences of URLs followed by past users,
    in order to predict future ones (this is much
    used for Web advertisement purposes)

35
  • Sequential Patterns
  • 30 of clients who visited /products/software/,
    had done a search in Yahoo using the keyword
    software before their visit
  • 60 of clients who placed an online order for
    WEBMINER, placed another online order for
    software within 15 days

36
Pattern Discovery (cont.)
  • Path Analysis
  • Types of Path/Usage Information
  • Most Frequent paths traversed by users
  • Entry and Exit Points
  • Distribution of user session durations / User
    Attrition
  • Examples
  • 60 of clients who accessed /home/products/file1.
    html, followed the path /home gt /home/whatsnew
    gt /home/products gt /home/products/file1.html
  • (Olympics Web site) 30 of clients who accessed
    sport specific pages started from the Sneakpeek
    page.
  • 65 of clients left the site after 4 or less
    references.

37
Data and Transaction Model for Association Rules
  • Let L be a set of server access log entries. A
    log entry l ? L has the following components
  • . The IP address of client, denoted l.ip
  • . The user id for the client, denoted l.uid
  • . The URL of the page accessed by the client,
    denoted by l.url
  • . The time of access l.time

38
Data and Transaction Model for Association Rules
  • Definition 1 An association transaction t is a
    triple

39
Data and Transaction Model for Association Rules
40
Data and Transaction Model for Association Rules
41
Data and Transaction Model for Sequential Rules

42
Data and Transaction Model for Sequential Rules
43
Application of Association rules
44
Application of Association rules
45
Application of Sequential pattern rules
46
Application of Sequential pattern rules
47
Experimental Results
48
Experimental Results
49
Example Session Inference with Referrer Log
Agent
Time
IP
URL
Referrer
1 www.aol.com 083000 A
Mozillar/2.0 AIX 4.1.4
2 www.aol.com 083001 B E
Mozillar/2.0 AIX 4.1.4
3 www.aol.com 083002 C B
Mozillar/2.0 AIX 4.1.4
4 www.aol.com 083001 B
Mozillar/2.0 Win 95
5 www.aol.com 083003 C B
Mozillar/2.0 Win 95
6 www.aol.com 083004 F
Mozillar/2.0 Win 95
7 www.aol.com 083004 B A
Mozillar/2.0 AIX 4.1.4
8 www.aol.com 083005 G B
Mozillar/2.0 AIX 4.1.4
Identified Sessions S1 gt A gt B gt
G from references 1, 7, 8 S2 E gt B gt C
from references 2, 3 S3 gt B gt
C from references 4, 5 S4 gt F from
reference 6
50
Applications for Web-Based Organizations
  • Electronic Commerce
  • determine lifetime value of clients
  • design cross marketing strategies across products
  • evaluate promotional campaigns
  • target electronic ads and coupons at user groups
    based on their access patterns
  • predict user behavior based on previously learned
    rules and users profile
  • present dynamic information to users based on
    their interests and profiles

51
Other Applications
  • Effective and Efficient Web Presence
  • determine the best way to structure the Web site
  • identify weak links for elimination or
    enhancement
  • A site-specific web design agent
  • prefetch files that are most likely to be
    accessed
  • Intra-Organizational Applications
  • enhance workgroup management communication
  • evaluate Intranet effectiveness and identify
    structural needs requirements
Write a Comment
User Comments (0)
About PowerShow.com