Web Usage Mining - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Web Usage Mining

Description:

Proxy traces may reveal the actual HTTP request from multiple clients to multiple Web servers ... views through client or proxy-level cache are not available ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 33
Provided by: christop1
Category:
Tags: mining | proxy | usage | web

less

Transcript and Presenter's Notes

Title: Web Usage Mining


1
Web Usage Mining
  • Chris Yang

2
Three Phases of Web Usage Mining
  • Discover usage patterns from Web data to
    understand and better serve the needs of
    Web-based applications (Srivastava et al., 2000)
  • Three phases
  • Preprocessing
  • Pattern discovery
  • Pattern analysis

3
(No Transcript)
4
Motivation of Web Usage Mining
  • Bring vendor and end customer in electronic
    commerce closer
  • Mass customization
  • Vendor may personalize his product message for
    individual customers at a massive scale

5
Data Sources
  • Sever
  • Web server log explicitly records the browsing
    behavior of site visitors and reflects the access
    of a Web site by multiple users
  • Formats
  • Common log
  • Extended log
  • Web log may not be completely reliable
  • Caching files stored at client but not accessed
    from server
  • Information pass through the POST method will not
    be available in a server log

6
HTTP
  • The Web's RPC on top of TCP/IP
  • It is stateless, which means that a separate
    connection is made for every request
  • Simple to implement, yet incur overhead
  • Each HTTP client/server interaction consists of
  • a single request/reply interchange
  • HTTP request
  • HTTP response

7
  • HTTP request message consists of
  • request line
  • method or command to apply to a server resource
  • e.g. GET, POST
  • URL (without protocol and server domain name)
  • the protocol version used by the client, e.g.
    HTTP/1.0
  • request header fields
  • Pass additional information about the request and
    the client itself to the server - much like RPC
    parameters
  • Each header filed consists of a name, followed by
    and the field value
  • the entity body (optional)
  • Clients use it to pass bulk information to the
    server (CGI)
  • Examples of HTTP methods
  • GET - retrieve the specified URL
  • POST - send this data to the specified URL
  • Examples of HTTP header fields
  • Accept - lists acceptable MIME type/subtype
    contents
  • User-Agent - provides client browser information

Note crlf carriage-return/line-feed
8
  • HTTP response message
  • response header line
  • HTTP version, the status of the response, and an
    explanation of the returned status
  • response header fields
  • Information that describes the server's
    attributes and the returned HTML document to
    client
  • entity body
  • Contains an HTML document that a client has
    requested
  • Each HTML document needs a separate request
    message
  • stateless
  • The result code 200 indicates that the request is
    successful.

9
Data Source - Server
  • Web server log in extended log format

10
Data Source - Server
  • Packet sniffing
  • Monitor network traffic coming to a Web server
  • Extract usage data directly from TCP/IP packets
  • Cookies
  • Tokens generated by the Web server for individual
    client browsers to automatically track the site
    visitor
  • HTTP protocol is stateless which makes tracking
    individual users difficult
  • Cookies rely on implicit user cooperation
  • Query data
  • CGI scripts
  • URI for CGI programs may contain additional
    parameter values to be passed to CGI applications

11
Data Source - Client
  • Client
  • Remote agent (e.g. Javavscripts or Java applets)
  • Modifying the source code of an existing browser
    to enhance data collection capabilities
  • Difficulty - Require client cooperation to enable
    the functionality of Javascripts and Java Applets
    or voluntarily use of the modified browsers

12
Data Source - Proxy
  • Proxy
  • Caching between client browsers and Web servers
  • Proxy traces may reveal the actual HTTP request
    from multiple clients to multiple Web servers
  • It helps to characterizing the browsing behavior
    of a group of anonymous users sharing a common
    proxy server

13
Data Abstractions
  • Data from server, client and proxy helps us to
    construct data abstractions
  • Users, server sessions, episodes, click-streams,
    and page views
  • W3C Web Characterization Activity (WCA) has
    drafted a Web term definitions relevant to Web
    usage (http//www.w3.org/WCA)
  • User a single individual that is accessing file
    from one or more Web servers through a browser
  • Difficulty to identify user a user may access
    through different machines or use more than one
    agent on a single machine
  • Page view page view consists of every file that
    contributes to the display on a users browser at
    one time
  • Includes several files such as frames, graphics,
    and scripts
  • When users download a Web page by clicking an
    anchor text or submitting an URL, he/she is not
    aware of how many frames, graphics, images, or
    scripts he/she is receiving
  • Click-stream a sequential series of page view
    requests
  • Server may not have all information to obtain the
    click-stream
  • Page views through client or proxy-level cache
    are not available at server
  • User session the click-stream of page views for
    a single user across the entire Web
  • In practice, only the portion of user session
    that is accessing a particular site can be
    identified.
  • Server session the set of page views in a user
    session for a particular Web site
  • Episode any semantically meaningful subset of a
    user or server session

14
Phase 1 Preprocessing
  • Usage Preprocessing
  • Due to the incompleteness of available data,
    usage preprocessing is a difficult task
  • Typical problems
  • Unless client side tracking is used, only IP
    address, agent, and server-side click stream are
    available
  • Single IP address / Multiple server sessions
  • Internet service providers (ISPs) have a pool of
    proxy servers
  • A proxy server may have several users accessing a
    Web site, potentially over the same time period
  • Multiple IP address / Single server sessions
  • Some ISPs or privacy tools randomly assign each
    request from a user to one of several IP
    addresses
  • Multiple IP address / Single user
  • A user accesses the Web from different machines
    (multiple IP address from session to session)
  • Multiple agent / Single user
  • A user uses more than one browser appears as
    multiple users

15
Usage Preprocessing
  • Segmenting click-stream into sessions
  • It is difficult to know when a user leave a Web
    site
  • A thirty-minute time out is often used (Catledge
    and Pitkow, 1995)
  • In some cases, session ID is embedded in each
    URI, session is defined by content server
  • Content from user action
  • Content servers maintain state variables for each
    active session, the information to determine the
    content by a user request is not always available

16
  • Using referrer and agent information, 4 sessions
    are determined

17
Content Preprocessing and Structure Preprocessing
  • Content Preprocessing
  • Converting the text, image, scripts, and other
    multimedia files into forms that are useful for
    Web usage mining
  • Classification
  • By content
  • By intended use (Cooley et al., 1999 Pirolli et
    al., 1996)
  • Convey information, gather information from user,
    allow navigation, or combination
  • Structure Preprocessing
  • Hyperlinks between page views

18
Phase 2 Pattern Discovery
  • Statistical Analysis
  • Perform descriptive statistical analysis (such as
    mean, median, frequency etc.) on page views,
    viewing time and length of a navigational path
    from session file
  • Web traffic analysis tools produce periodic
    reports
  • Most frequently accessed pages
  • Average view time of a page
  • Average length of a path through a site
  • Useful for improving the system performance,
    enhancing the security of the system,
    facilitating the site modification task, and
    providing support for marketing decisions

19
  • Association Rules
  • Relate pages that are most often referenced
    together in a single server session
  • Sets of pages that are accessed together with a
    support value exceeding some specified threshold
  • These page may not directed connected by
    hyperlinks
  • Useful for Web designers to restructure their Web
    sites
  • These rules serve as a heuristic for prefetching
    documents in order to reduce user-perceived
    latency when loading a page from a remote site

20
  • Clustering
  • Group together a set of items having similar
    characteristics
  • Usage clusters
  • Establish groups of users exhibiting similar
    browsing patterns
  • Useful for inferring user demographics in order
    to perform market segmentation
  • Page clusters
  • Discover groups of pages that have related
    content
  • Useful for search engines and Web assistance
    providers

21
  • Classification
  • Mapping a data item into one of several
    predefined classes
  • Develop a profile of users belonging to a
    particular class or category
  • Requires feature extraction and selection that
    best describe the properties of a given class or
    category
  • Techniques
  • Decision tree classifiers, naïve Bayesian
    classifier, k-nearest neighbor classifiers,
    support vector machines, etc.
  • E.g.
  • 30 users who place online orders in
    /Product/Music are in the 19-25 age group and
    live on the West coast

22
  • Sequential Pattern
  • Find inter-session patterns
  • The presence of a set of items is followed by
    another item in a time-ordered set of sessions or
    episode
  • Useful for predicting future pattern in order to
    place advertisements for a certain user groups
  • Temporal analysis
  • Trend analysis, change point detection, or
    similarity analysis

23
  • Dependency Modeling
  • Develop a model capable of representing
    significant dependencies among the various
    variables in the Web domain
  • E.g.
  • A model representing the different stages a
    visitor undergoes while shopping in an online
    store based on the action chosen (from casual
    visitor to a serious potential buyer)
  • Techniques
  • Hidden Markov models, Bayesian belief network

24
Phase 3 Pattern Analysis
  • Filter out uninteresting rules or patterns from
    the set found in the pattern discovery phase

25
Major Application Areas for Web Usage Mining
(Sriastava et al., 2000)
26
Architecture of the WebSIFT system (Cooley et
al., 1999)
27
WUM Web Usage MinerNavigation behavior in Web
sites(Berendt and Spiliopoulou, 2000)
  • Web site is a network of structurally or
    semantically interrelated nodes (built in a way
    that reflects the designers intuition).
  • Quality of a Web site
  • The conformance of the Web sites structure to
    the intuition of each group of visitors accessing
    the site.
  • Intuition of visitors is indirectly reflected in
    their navigation behavior (represented in the
    browsing pattern)
  • Measure of the quality of Web site
  • Quality of service (e.g. response time)
  • Quality of navigation
  • Accessibility
  • Information utility
  • Ease of use
  • Attractiveness of the presentation metaphor

28
Sequence Mining
  • Sequence mining supports the discovery of
    frequent paths composed of not necessarily
    adjacent pages
  • Given a collection of transactions ordered in
    time (each transaction contains a set of items),
    discover sequences of maximal length with support
    above a given threshold
  • A sequence is an ordered list of elements, an
    element being a set of items appearing together
    in a transaction
  • Elements need not be adjacent in time but their
    ordering in a sequence must not violate the time
    ordering of the support transactions
  • Example
  • Considering a Web site with pages W, A, B, C, D,
    E and there is a link from W to D
  • WABC (1000 times), WDBC (100 times), WABDEC (400
    times)
  • Frequency threshold 25
  • WD appears 500 (400100) times (33) and above
    threshold
  • In the above example, link from W to D only used
    1 out of 5 cases. Therefore, sequence mining is
    not useful in understanding the usefulness of a
    hyperlink.
  • In WUM, a navigation pattern is a directed
    acyclic graph composed of a group of sequences
    that conform to a template
  • The purpose is to determine the usage of which
    links is responsible for the frequency of
    sequences

29
WUM Navigation Sequences and Navigation
Patterns
  • A session is a directed list of page accesses
    performed by a user during his/her visit in a
    site
  • A navigation pattern is a structure that
  • Emphasizes the common parts among the sessions
  • Does not purge the dissimilar parts
  • Annotates both common and non-common parts with
    quantitative information
  • P is a set of Web pages in the site
  • If the site is dynamic nature, P is the set of
    all pages that can be generated
  • D is a dataset of sessions
  • A session is a directed list of elements from P
  • A sequence of length n is a vector s? P ? N (N is
    a set of positive integers)
  • U P ? N
  • Example
  • P a,b,c,d,e,f,g,h
  • ab, ac, abcde, bcbf, abdfhe are sessions
    appearing in D

30
Generalized sequences
  • wildcard low high is matched by any sequence
    of elements that has length at least low and at
    most high (low ? 0, high ? low)
  • wildcard ?- its range is not of interest
  • A generalized sequence g is a vector g1 ? g2 ? ?
    gn
  • The number of non-wildcard elements in g is the
    length of g, length(g)
  • Example
  • (a,1) ? (b,1) 24 (e,1) matches with Session 3
    and 5
  • The group of sequences that match g constitute
    the navigation pattern of g navp(g)
  • The hits of g, hits(g), is the number of
    sequences that matched by g.
  • confidence(gi, gj, g) hits(g1 ? gi-1 ? gi) /
    hits(g1 ? gj)
  • g (a,1) ? (b,1) 24 (e,1)
  • hits(g) 30 10 40

31
Aggregate tree and log
  • navp(g) is modeled as a tree structure (aggregate
    tree)
  • Aggregate log

32
Discover navigation pattern
  • A template is a vector comprised of variable
    ranging over the domain U and of wildcards
  • A mining query is a template declaration
    accompanied by a conjunction of constraints on
    the permissible values of the template variables
  • Example
  • NODE AS x y z
  • TEMPLATE x ? y 24 z AS t
  • WHERE x.support ? 85
  • AND (y.support / x.support ) ? 0.8
Write a Comment
User Comments (0)
About PowerShow.com