Web Usage Mining for Internet Recommendation - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Web Usage Mining for Internet Recommendation

Description:

Survey on Web Mining. My Research. More information in WWW, too much time spend to surf ... Most researchers only list their mining results, without statement on the ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 28
Provided by: tsz6
Category:

less

Transcript and Presenter's Notes

Title: Web Usage Mining for Internet Recommendation


1
Web Usage Mining for Internet Recommendation
Tingshao Zhu tszhu_at_cs.ualberta.ca
2
Outline
Motivation Survey on Web Mining My Research
3
Motivation
  • More information in WWW, too much time spend to
    surf
  • WWW evolves rapidly, still some patterns
    unchangeable
  • Zipfs law, e.g. (file size vs. number of
    requests session length vs. number of sessions)
  • Can we explore such patterns, and provide better
    services?

4
Research Overview
  • Mining User Patterns
  • Predict Destination Pages for users grouping
    user into community extract patterns for each
    community.
  • Recommendation Generation
  • Generate recommendations(Destination Pages) based
    on the observed session
  • Validation
  • Apply in real world for real people Get feedback
    for evaluation.

5
Outline
Motivation Survey on Web Mining My Research
6
Web Mining
  • Useful patterns extraction from WWW resources
  • Employing techniques from Data Mining, Machine
    learning, information retrieval, etc.

7
Web Content Mining
  • An automatic process that extracts patterns from
    on-line information, HTML files, images, or
    E-mails.
  • Text Classification Given labeled training
    examples, training a classifier for on-line
    documents classification. e.g. assign new web
    document to Yahoo hierarchy.
  • E-mail Threading Identify incoming E-mails
    thread, and group all the E-mails with same
    thread together.
  • Text Clustering Especially clustering web pages.
    Macskassy et al. have reported that even if
    clustering analysis can be carried out in web
    pages, little success would be obtained.

8
Web Structure Mining
Analysis of the link structure of the web, and
one of its purposes is to identify more
preferable documents, e.g., PageRank(Google),
HITS(Hub/Authority). The intuition is that a
hyperlink from document A to document B implies
that the author of document A thinks document B
contains worthwhile information.
9
Web Usage Mining
Application of data mining techniques to discover
usage patterns from Web data, in order to
understand and better serve the needs of
Web-based applications.
10
Dynamic Links, Adaptive web site, web site
evaluation
Process log data to be suitable for learning
Application
Preprocessing
Pattern Discovery
Web Log Files
Evaluation
Clustering, Classification, Association Rule,
Sequential Pattern
11
Preprocessing
Deal with original log entries(noisy, not
available for pattern discovery), processed to
fit for learning algorithms.
12
Preprocessing Data Filtering
  • Status Code
  • 1xx Informational
  • 2xx Success
  • 3xx Redirection
  • 4xx Client Error
  • 5xx Server Error
  • Automatic requests
  • Embedded image files, frame set,
  • Robots
  • Frequently check /robots.txt, 8 requests for one
    month of CS.

13
Preprocessing User Identification
  • User identification is very important for session
    identification, especially for personalization.
  • Problems
  • dynamically allocated IP address.
  • cache and proxy servers on the web.
  • Proxy brings more troubles
  • Possible Solutions
  • Cookies
  • Some heuristics, compare the OS, browser of
    users, by using web topology,

14
Preprocessing Session Identification
  • A group of user activities that are related to
    each other not only through an evolving
    information need but also through close proximity
    in time.
  • Time out no two consecutive requests are
    separated by a interval more than a predefined
    threshold (mostly 25.5 or 30 minutes)
  • Reference Length one session will be ended with
    a content page. Content page is the page that
    user spends more time than threshold.
  • Maximal Forward Reference Up to the page before
    a backward reference is made.
  • Time Window All the time spent on the session is
    less than a threshold.

15
Pattern Discovery -- Clustering
  • Find the cluster of user, page, or sessions from
    web log files.
  • User A
  • S1 0, 1, 3, 5
  • S2 2, 3, 5, 6
  • User B
  • S3 1, 3, 5
  • User C
  • S4 3, 5
  • User Clustering Clustering users based on their
    behaviors.
  • Page Clustering Clustering pages according to
    the users access over them.
  • Session Clustering Find some interesting session
    clusters. Each session cluster may be one
    Interesting topic within the web site.

16
Pattern Discovery -- Classification
  • Page interesting estimator based on some
    heuristics, such as
  • How often does the user visit the page?
  • Is this page included in the users bookmarks?
  • How long has the user spend in this page?
  • Has the user visited the web page recently?
  • How many links in the web page are interesting to
    the user?
  • etc.
  • Classify page to be interesting or not. Take all
    the accessed pages
  • as interesting pages, and those never visited as
    non-interesting, extract features from each page,
    and train a classifier.

17
Pattern Discovery -- Association Rule
  • The problem of discovering all association rules
    can be decomposed
  • into two subproblems
  • Find all sets of items (itemsets) that have
    transaction support above minimum support(Large
    Itemsets). The support for an itemset is the
    number of transactions that contain the
    itemset(Independent of order).
  • Use the large itemsets to generate the desired
    rules. For every large itemset, find all
    nonempty subsets. For every such subset, output
    a rule if its confidence is higher than
    minConfidence.

0 1 2 0 1 3 6 7 3 5 6 8 0
4 11
18
Pattern Discovery -- Sequential Pattern
Extract maximal sequences from transaction data
or web usage data.
19
Pattern Discovery Misc.
Zaïane et al. propose the use of On-Line
Analytical Processing (OLAP) technology in web
usage mining. Hypertext Probabilistic Grammar
(HPG) to capture user web navigation patterns
User sessions are presented as a hypertext
probabilistic grammar (HPG) whose higher
probability strings correspond to the navigation
trails preferred by the user. WUM uses an
efficient data structure -- Aggregated Tree to
store the user sessions, and it also provides
query language to extract interesting patterns
from the aggregated session data. Hy
visualization system is used to visualize the
portion of the World Wide Web explored during a
browsing session. Hy displays graphically the
history of the navigation and multiple views of
the structure of that portion of the Web.
20
Application
  • Dynamic Hyperlink generate dynamic links while
    watching user browsing. Clustering sessions by
    Leader, matching observed session with all the
    session clusters, generate recommendations from
    all matched clusters, pruning these pages that
    have links with any page in current session.
  • Personalization The basic goal of
    personalization system is to provide users with
    what they want without requiring them to ask for
    it explicitly, e.g., generate recommendations for
    each user while browsing.
  • Adaptive Web Site Change the web site according
    to the patterns extracted by web usage mining,
    e.g. collect all the pages on cheap trucks and
    put them in a new index page.
  • Web Site Improvement Compare the characteristics
    of non-customers and customers, and change the
    web site to push more non-customers become
    customers, e.g. find the sequential patterns of
    non-customer and customer, and change the
    hyperlinks within some pages.

21
Outline
Motivation Survey on Web Mining My Research
22
Research Proposal
  • Pattern Discovery
  • Predict the Destination Pages
  • User Community
  • Recommendation Generation
  • Better Services -- help users to find what they
    most want very quickly.
  • Validation
  • Evaluate patterns in real world by real people.

23
Destination Prediction
Destination Pages are these pages that user must
see in order to finish his/her task.
Direct URL
Search Engine
Bookmark
Wandering
Seeking
24
User Community Construction
For one web site, some of its users may have some
common characteristics, and thus users with
common interests can be grouped into a user
community. Take all the sessions of one user
into account while clustering. Incorporated with
user information
Possible User Communities Faculty
Graduate Student
Russ Greiner
Tingshao Zhu
Osmar Zaiane
Peng Wang

Wei Zhou
25
Recommendation Generation
  • General recommendations over the whole site. When
    a new user comes to a web site, according to the
    pattern discovered from previous users, some
    recommendations will be presented.
  • Given the users observed sequence, how to
    generate the most informative recommendations.
  • Given all users session, generate
    recommendations that may never be visited by one
    users.

26
Validation
  • Most researchers only list their mining results,
    without statement on the efficiency of these
    patterns.
  • Calculate some measurements to compare different
    algorithms.
  • Controlled Web surfing, in real world by real
    people. (with Gerald Haeubl, School of Business)

27
Q A
Write a Comment
User Comments (0)
About PowerShow.com