Title: Web Usage Mining for Internet Recommendation
1Web Usage Mining for Internet Recommendation
Tingshao Zhu tszhu_at_cs.ualberta.ca
2Outline
Motivation Survey on Web Mining My Research
3Motivation
- More information in WWW, too much time spend to
surf - WWW evolves rapidly, still some patterns
unchangeable - Zipfs law, e.g. (file size vs. number of
requests session length vs. number of sessions) - Can we explore such patterns, and provide better
services?
4Research Overview
- Mining User Patterns
- Predict Destination Pages for users grouping
user into community extract patterns for each
community. - Recommendation Generation
- Generate recommendations(Destination Pages) based
on the observed session - Validation
- Apply in real world for real people Get feedback
for evaluation.
5Outline
Motivation Survey on Web Mining My Research
6Web Mining
- Useful patterns extraction from WWW resources
- Employing techniques from Data Mining, Machine
learning, information retrieval, etc.
7Web Content Mining
- An automatic process that extracts patterns from
on-line information, HTML files, images, or
E-mails. - Text Classification Given labeled training
examples, training a classifier for on-line
documents classification. e.g. assign new web
document to Yahoo hierarchy. - E-mail Threading Identify incoming E-mails
thread, and group all the E-mails with same
thread together. - Text Clustering Especially clustering web pages.
Macskassy et al. have reported that even if
clustering analysis can be carried out in web
pages, little success would be obtained.
8Web Structure Mining
Analysis of the link structure of the web, and
one of its purposes is to identify more
preferable documents, e.g., PageRank(Google),
HITS(Hub/Authority). The intuition is that a
hyperlink from document A to document B implies
that the author of document A thinks document B
contains worthwhile information.
9Web Usage Mining
Application of data mining techniques to discover
usage patterns from Web data, in order to
understand and better serve the needs of
Web-based applications.
10Dynamic Links, Adaptive web site, web site
evaluation
Process log data to be suitable for learning
Application
Preprocessing
Pattern Discovery
Web Log Files
Evaluation
Clustering, Classification, Association Rule,
Sequential Pattern
11Preprocessing
Deal with original log entries(noisy, not
available for pattern discovery), processed to
fit for learning algorithms.
12Preprocessing Data Filtering
- Status Code
- 1xx Informational
- 2xx Success
- 3xx Redirection
- 4xx Client Error
- 5xx Server Error
- Automatic requests
- Embedded image files, frame set,
- Robots
- Frequently check /robots.txt, 8 requests for one
month of CS.
13Preprocessing User Identification
- User identification is very important for session
identification, especially for personalization. - Problems
- dynamically allocated IP address.
- cache and proxy servers on the web.
- Proxy brings more troubles
- Possible Solutions
- Cookies
- Some heuristics, compare the OS, browser of
users, by using web topology,
14Preprocessing Session Identification
- A group of user activities that are related to
each other not only through an evolving
information need but also through close proximity
in time. - Time out no two consecutive requests are
separated by a interval more than a predefined
threshold (mostly 25.5 or 30 minutes) - Reference Length one session will be ended with
a content page. Content page is the page that
user spends more time than threshold. - Maximal Forward Reference Up to the page before
a backward reference is made. - Time Window All the time spent on the session is
less than a threshold.
15Pattern Discovery -- Clustering
- Find the cluster of user, page, or sessions from
web log files. - User A
- S1 0, 1, 3, 5
- S2 2, 3, 5, 6
- User B
- S3 1, 3, 5
- User C
- S4 3, 5
- User Clustering Clustering users based on their
behaviors. - Page Clustering Clustering pages according to
the users access over them. - Session Clustering Find some interesting session
clusters. Each session cluster may be one
Interesting topic within the web site.
16Pattern Discovery -- Classification
- Page interesting estimator based on some
heuristics, such as - How often does the user visit the page?
- Is this page included in the users bookmarks?
- How long has the user spend in this page?
- Has the user visited the web page recently?
- How many links in the web page are interesting to
the user? - etc.
- Classify page to be interesting or not. Take all
the accessed pages - as interesting pages, and those never visited as
non-interesting, extract features from each page,
and train a classifier.
17Pattern Discovery -- Association Rule
- The problem of discovering all association rules
can be decomposed - into two subproblems
- Find all sets of items (itemsets) that have
transaction support above minimum support(Large
Itemsets). The support for an itemset is the
number of transactions that contain the
itemset(Independent of order). - Use the large itemsets to generate the desired
rules. For every large itemset, find all
nonempty subsets. For every such subset, output
a rule if its confidence is higher than
minConfidence.
0 1 2 0 1 3 6 7 3 5 6 8 0
4 11
18Pattern Discovery -- Sequential Pattern
Extract maximal sequences from transaction data
or web usage data.
19Pattern Discovery Misc.
Zaïane et al. propose the use of On-Line
Analytical Processing (OLAP) technology in web
usage mining. Hypertext Probabilistic Grammar
(HPG) to capture user web navigation patterns
User sessions are presented as a hypertext
probabilistic grammar (HPG) whose higher
probability strings correspond to the navigation
trails preferred by the user. WUM uses an
efficient data structure -- Aggregated Tree to
store the user sessions, and it also provides
query language to extract interesting patterns
from the aggregated session data. Hy
visualization system is used to visualize the
portion of the World Wide Web explored during a
browsing session. Hy displays graphically the
history of the navigation and multiple views of
the structure of that portion of the Web.
20Application
- Dynamic Hyperlink generate dynamic links while
watching user browsing. Clustering sessions by
Leader, matching observed session with all the
session clusters, generate recommendations from
all matched clusters, pruning these pages that
have links with any page in current session. - Personalization The basic goal of
personalization system is to provide users with
what they want without requiring them to ask for
it explicitly, e.g., generate recommendations for
each user while browsing. - Adaptive Web Site Change the web site according
to the patterns extracted by web usage mining,
e.g. collect all the pages on cheap trucks and
put them in a new index page. - Web Site Improvement Compare the characteristics
of non-customers and customers, and change the
web site to push more non-customers become
customers, e.g. find the sequential patterns of
non-customer and customer, and change the
hyperlinks within some pages.
21Outline
Motivation Survey on Web Mining My Research
22Research Proposal
- Pattern Discovery
- Predict the Destination Pages
- User Community
- Recommendation Generation
- Better Services -- help users to find what they
most want very quickly. - Validation
- Evaluate patterns in real world by real people.
23Destination Prediction
Destination Pages are these pages that user must
see in order to finish his/her task.
Direct URL
Search Engine
Bookmark
Wandering
Seeking
24User Community Construction
For one web site, some of its users may have some
common characteristics, and thus users with
common interests can be grouped into a user
community. Take all the sessions of one user
into account while clustering. Incorporated with
user information
Possible User Communities Faculty
Graduate Student
Russ Greiner
Tingshao Zhu
Osmar Zaiane
Peng Wang
Wei Zhou
25Recommendation Generation
- General recommendations over the whole site. When
a new user comes to a web site, according to the
pattern discovered from previous users, some
recommendations will be presented. - Given the users observed sequence, how to
generate the most informative recommendations. - Given all users session, generate
recommendations that may never be visited by one
users.
26Validation
- Most researchers only list their mining results,
without statement on the efficiency of these
patterns. - Calculate some measurements to compare different
algorithms. - Controlled Web surfing, in real world by real
people. (with Gerald Haeubl, School of Business)
27Q A