Title: Introduction to Web Mining and Web Usage Mining
1Introduction to Web Mining and Web Usage Mining
- Course Usability of Interactive Applications
- Year 2007
- Lecturer Federico M. Facca (facca_at_elet.polimi.ti)
- Main Lecturer Francesca Rizzo (rizzo_at_elet.polimi.
it)
2Agenda
- Web Mining
- Introduction
- Web Content Mining
- Web Structure Mining
- Web Usage Mining
- Introduction
- Algorithms
- Applications
- Examples
- References
3Web MiningIntroduction
- Web Mining
- is the application of data mining techniques to
discover patterns from the Web. - Data Mining
- also called Knowledge-Discovery in Databases
(KDD) is the process of automatically searching
large volumes of data for patterns (extracting
Knowledge from data)
4Web MiningIntroduction
- Web Content Mining
- discover useful information from the content of a
web page. The type of the web content may consist
of text, image, audio or video data in the web - Web Structure Mining
- using the graph theory to analyse the node and
connection structure of a web site - Web Usage Mining
- analyse and discover interesting patterns of
users usage data on the web. The usage data
records the users behaviour when the user
browses or makes transactions on the web site.
5Web MiningIntroduction
WCM
Wrapper
WUM
Characterizing
?
Web data
InformationRetrieval
Information Extraction
Generalizzation
Analysis
Knowledge
WSM
Indexer
- According to the Web Mining category and of the
objective, the different phases acquire a
different role and importance
Categorization
Crawler/Spider
Clustering
Ranker
6Web MiningWeb Content Mining
- Discovery of useful information from web contents
/ data / documents - Web data contents text, image, audio, video,
metadata and hyperlinks. - Information Retrieval View ( Structured
Semi-Structured) - Assist / Improve information finding
- Filtering Information to users on user profiles
- Database View
- Model Data on the web
- Integrate them for more sophisticated queries
7Web MiningWeb Content Mining
- Developing intelligent tools for IR
- Finding keywords and key phrases
- Discovering grammatical rules and collocations
- Hypertext classification/categorization
- Extracting key phrases from text documents
- Learning extraction models/rules
- Hierarchical clustering
- Predicting (words) relationship
8Web MiningWeb Structure Mining
- To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page. - based on the hyperlinks, categorizing the Web
pages and generated information. - discovering the structure of Web document
itself. - discovering the nature of the hierarchy or
network of hyperlinks in the Website of a
particular domain.
9Web MiningWeb Structure Mining
- Finding authoritative Web pages
- Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic - Hyperlinks can infer the notion of authority
- The Web consists not only of pages, but also of
hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of
latent human annotation - A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page
10Web Usage MiningIntroduction
- Known also as web log mining
- Not only statistical measures
- Not only server logs
- Can be organized according 3 orthogonal dimensions
- Techniques
- Statistical Analysis
- Association Rules
- Clustering
- Sequential Patterns
- Rough Sets
- Fuzzy Logic
- Visualizzation
- Graphs
- Relational Tables
- OLAP
- Query languages
- Applications
- Personalization
- Usability Testing
- User modeling
- Marketing
- Adaptive Web sites
11Web Usage MiningTerms
- User
- The principal using a client to interactively
retrieve and render resources or resource
manifestations. - Page view
- Visual rendering of a Web page in a specific
client environment at a specific point of time - Click stream
- a sequential series of page view request
- User session
- a delimited set of user clicks (click stream)
across one or more Web servers. - Server session (visit)
- a collection of user clicks to a single Web
server during a user session. - Episode
- a subset of related user clicks that occur within
a user session.
12Web Usage MiningApplications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Facilitates personalization/adaptive sites
- Improve site design
- Fraud/intrusion detection
- Predict users actions (allows prefetching)
13Web Usage MiningInformation Retrieval
- The information is usually easy to obtain (web
log, cookies, proxy log, data base log...). - Information can be obtained from server, client
e proxy.
14Web Usage MiningInformation Extraction
- Completing missing information using some
heuristics - Identification of sessions/episodes
- Mining and conversion of contents to the
elaboration format (WCM) - Mining of the web site structure (WSM)
- Finding and removing data distortion (e.g.
crawlers sessions). - Representing the information in the correct
format for the pattern discovery task
15Web Usage MiningGeneralization
- Usage Patterns
- Navigation patterns
- Behaviour pattern
- Access patterns
- Techniques
- Association rules (e.g. 45 users that visited
products/product1.html also visited
products/productX.html ). - Clustering (identifying group of users that show
similar sessions) - Classification (e.g. 30 of users that bought
products from the category Music are between
18-25 years old and live in north Europe) - Sequential patterns (e.g. 15 of users that
bought a product in the category Music after a
week made a new order in the category Book)
16Web Usage MiningAnalysis
- Removing patterns that do not provide new
knowledge - Visualization of acquired knowledge
- Usage of discovered pattern to
- Categorizing users
- Personalizing contents/advertisements
- Modifying dynamically web site structure
- Marketing
- Improving application usability
17Web Usage MiningProblems with Web Logs
- Identifying users
- Clients may have multiple streams
- Clients may access web from multiple hosts
- Proxy servers many clients/one address
- Proxy servers one client/many addresses
- Data not in log
- POST data (i.e., CGI request) not recorded
- Cookie data stored elsewhere
18Web Usage MiningProblems with Web Logs
- Missing data
- Pages may be cached
- Referring page requires client cooperation
- When does a session end?
- Use of forward and backward pointers
- Typically a 30 minute timeout is used
- Web content may be dynamic
- May not be able to reconstruct what the user saw
- Use of spiders and automated agents automatic
request we pages
19Web Usage MiningProblems with Web Logs
- Like most data mining tasks, web log mining
requires preprocessing - To identify users
- To match sessions to other data
- To fill in missing data
- Essentially, to reconstruct the click stream
20Web Usage MiningWeb Server Logs
- Web servers have the ability to log all
- requests
- Web server log formats
- Common Log Format (CLF)
- Extended Log Format allows configuration of log
file - Generate vast amounts of data
21Web Usage MiningWeb Server Logs
- Common Log Format
- Remotehost browser hostname or IP
- Remote log name of user (almost always "-"
meaning "unknown") - Authuser authenticated username
- Date Date and time of the request
- "request exact request lines from client
- Status The HTTP status code returned
- Bytes The content-length of response
22Web Usage MiningWeb Server Logs
23Web Usage MiningPre-Processing
- Data Cleaning
- Removes log entries that are not needed for the
mining process - Data Integration
- Synchronize data from multiple server logs,
metadata - User Identification
- Associates page references with different users
- Session/Episode Identification
- Groups users page references into user sessions
- Page View Identification
- Path Completion
- Fills in page references missing due to browser
and proxy caching -
24Web Usage MiningPre-Processing
- A single IP address is used by many users
- Different IP addresses in a single session
- Missing cache hits in the server logs
-
Proxy server
Different users
Web server
Single user
ISP server
Web server
25Web Usage MiningPre-Processing
- Remote Agent
- A remote agent is implemented in Java Applet
- It is loaded into the client only once when the
first page is accessed - The subsequent requests are captured and send
back to the server - Modified Browser
- The source code of the existing browser can be
modified to gain user specific data at the client
side - Dynamic page rewriting
- When the user first submit the request, the
server returns the requested page rewritten to
include a session specific ID - Each subsequent request will supply this ID to
the server - Heuristics
- Use a set of assumptions to identify user
sessions and find the missing cache hits in the
server log
26Web Usage MiningSession identification heuristics
- Timeout
- if the time between pages requests exceeds a
certain limit, it is assumed that the user is
starting a new session - IP/Agent
- Each different agent type for an IP address
represents a different sessions - Referring page
- If the referring page file for a request is not
part of an open session, it is assumed that the
request is coming from a different session - Same IP-Agent/different sessions (Closest)
- Assigns the request to the session that is
closest to the referring page at the time of the
request - Same IP-Agent/different sessions (Recent)
- In the case where multiple sessions are same
distance from a page request, assigns the request
to the session with the most recent referrer
access in terms of time
27Web Usage MiningSessionization Example
28Web Usage MiningSessionization Example
29Web Usage MiningSessionization Example
30Web Usage MiningSessionization Example
31Web Usage MiningSessionization Example
32Web Usage MiningAssociation rule mining
- Proposed by Agrawal et al in 1993.
- It is an important data mining model studied
extensively by the database and data mining
community. - Assume all data are categorical.
- No good algorithm for numeric data.
- Initially used for Market Basket Analysis to find
how items purchased by customers are related. -
- Url1? Url4 sup 5, conf 100
33Web Usage MiningAssociation rule mining
- A set of items
- I i1, i2, , im
- Transaction t
- t a set of items, and t ? I
- Transaction Database T
- a set of transactions T t1, t2, , tn
34Web Usage MiningAssociation rule mining
- A transaction t contains X, a set of items
(itemset) in I, if X ? t. - An association rule is an implication of the
form - X ? Y, where X, Y ? I, and X ?Y ?
- An itemset is a set of items.
- E.g., X url1, url2, url3 is an itemset.
- A k-itemset is an itemset with k items.
- E.g., url1, url2, url3 is a 3-itemset
35Web Usage MiningAssociation rule mining
- Support
- The rule holds with support sup in T (the
transaction data set) if sup of transactions
contain X ? Y. - sup Pr(X ? Y).
- Confidence
- The rule holds in T with confidence conf if conf
of tranactions that contain X also contain Y. - conf Pr(Y X)
- An association rule is a pattern that states when
X occurs, Y occurs with certain probability.
36Web Usage MiningAssociation rule mining
t1 Url1, Url2, Url4 t2 Url1, Url3 t3 Url3,
Url5 t4 Url1, Url2, Url3 t5 Url1, Url2, Url6,
Url3, Url4 t6 Url2, Url6, Url4 t7 Url2, Url4,
Url6
- Transaction data
- Assume
- minsup 30
- minconf 80
- An example frequent itemset
- Url2, Url6, Url4 sup 3/7
- Association rules from the itemset
- Url6 ? Url4, Url2 sup 3/7, conf 3/3
-
- Url6, Url2 ? Url4, sup 3/7, conf 3/3
37Web Usage MiningAssociation rule mining
Apriori Algorithm
- Probably the best known algorithm
- Two steps
- Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets). - Use frequent itemsets to generate rules.
- E.g., a frequent itemset
- Url2, Url6, Url4 sup 3/7
- and one rule from the frequent itemset
- Url6 ? Url4, Url2 sup 3/7, conf 3/3
38Web Usage MiningAssociation rule mining
Apriori Algorithm
- Iterative algo
- Find all 1-item frequent itemsets then all
2-item frequent itemsets, and so on. - In each iteration k, only consider itemsets that
contain some k-1 frequent itemset. - Find frequent itemsets of size 1 F1
- From k 2
- Ck candidates of size k those itemsets of size
k that could be frequent, given Fk-1 - Fk those itemsets that are actually frequent,
Fk ? Ck (need to scan the database once).
39Web Usage MiningAssociation rule mining
Apriori Algorithm
Dataset T minsup0.5
itemsetcount 1. scan T ? C1 12,
23, 33, 41, 53 ? F1 12,
23, 33, 53 ? C2
1,2, 1,3, 1,5, 2,3, 2,5, 3,5 2. scan
T ? C2 1,21, 1,32, 1,51, 2,32,
2,53, 3,52 ? F2
1,32, 2,32, 2,53,
3,52 ? C3 2, 3,5 3. scan T ?
C3 2, 3, 52 ? F3 2, 3, 5
40Web Usage MiningAssociation rule mining
Apriori Algorithm
- Frequent itemsets ? association rules
- One more step is needed to generate association
rules - For each frequent itemset X,
- For each proper nonempty subset A of X,
- Let B X - A
- A ? B is an association rule if
- Confidence(A ? B) minconf,
- support(A ? B) support(A?B) support(X)
- confidence(A ? B) support(A ? B) / support(A)
41Web Usage MiningSequential pattern mining
- Association Rule concerns about what items are
appears together (at the same time) - Intra-transaction patterns
- Sequential Pattern concerns about what items
appears at different times - Inter-transaction patterns
42Web Usage MiningSequential pattern mining
- Itemset
- non-empty set of items. Each itemset is mapped to
an integer. - Sequence
- Ordered list of itemsets.
- Customer Sequence
- List of customer transactions ordered by
increasing transaction time. - A customer supports a sequence if the sequence is
contained in the customer-sequence. - Support for a Sequence
- Fraction of total customers that support a
sequence. - Maximal Sequence
- A sequence that is not contained in any other
sequence. - Large Sequence
- Sequence that meets minisup.
43Web Usage MiningSequential pattern mining
44Web Usage MiningSequential pattern mining
PrefixSpan algorithm
- ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
sequence lta(abc)(ac)d(cf)gt - Given sequence lta(abc)(ac)d(cf)gt
45Web Usage MiningSequential pattern mining
PrefixSpan algorithm
- Step 1 find length-1 sequential patterns
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets - The ones having prefix ltagt
- The ones having prefix ltbgt
-
- The ones having prefix ltfgt
46Web Usage MiningSequential pattern mining
PrefixSpan algorithm
- Only need to consider projections w.r.t. ltagt
- ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt - Find all the length-2 seq. pat. Having prefix
ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt - Further partition into 6 subsets
- Having prefix ltaagt
-
- Having prefix ltafgt
47Web Usage MiningSequential pattern mining
PrefixSpan algorithm
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database
Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt
Having prefix ltaagt
Having prefix ltafgt
ltaagt-proj. db
ltafgt-proj. db
48Web Usage MiningClustering
- Clustering is a technique for finding similarity
groups in data, called clusters. I.e., - it groups data instances that are similar to
(near) each other in one cluster and data
instances that are very different (far away) from
each other into different clusters. - Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning. - Due to historical reasons, clustering is often
considered synonymous with unsupervised learning. - In fact, association rule mining is also
unsupervised
49Web Usage MiningClustering
- The data set has three natural groups of data
points, i.e., 3 natural clusters.
50Web Usage MiningClustering
- Let us see some real-life examples
- Example 1 groups people of similar sizes
together to make small, medium and large
T-Shirts. - Tailor-made for each person too expensive
- One-size-fits-all does not fit all.
- Example 2 In e-commerce, segment customers
according to their similarities - To do targeted marketing.
51Web Usage MiningClustering
- A clustering algorithm
- Partitional clustering
- Hierarchical clustering
-
- A distance (similarity, or dissimilarity)
function - Clustering quality
- Inter-clusters distance ? maximized
- Intra-clusters distance ? minimized
- The quality of a clustering result depends on the
algorithm, the distance function, and the
application.
52Web Usage MiningClustering - K-means algorithm
- K-means is a partitional clustering algorithm
- Let the set of data points (or instances) D be
- x1, x2, , xn,
- where xi (xi1, xi2, , xir) is a vector in a
real-valued space X ? Rr, and r is the number of
attributes (dimensions) in the data. - The k-means algorithm partitions the given data
into k clusters. - Each cluster has a cluster center, called
centroid. - k is specified by the user
53Web Usage MiningClustering - K-means algorithm
- Given k, the k-means algorithm works as follows
- Randomly choose k data points (seeds) to be the
initial centroids, cluster centers - Assign each data point to the closest centroid
- Re-compute the centroids using the current
cluster memberships. - If a convergence criterion is not met, go to 2)
54Web Usage MiningClustering - K-means algorithm
- no (or minimum) re-assignments of data points to
different clusters, - no (or minimum) change of centroids, or
- minimum decrease in the sum of squared error
(SSE), - Ci is the jth cluster, mj is the centroid of
cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
55Web Usage MiningClustering - K-means algorithm
Select K and according, K centers in the space
56Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
57Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
58Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
59Web Usage MiningClustering - K-means algorithm
Three points change cluster
60Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
61Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center No change!
STOP
62Web Usage MiningApplications
- User characterizing
- Creation of user classes according to navigations
behaviours and visited contents. - Basic step for many of the other WUM applications
- Personalization
- Attracting users with advanced personalized
features (content, presentation, navigation). - Recommender systems based on user profiles and
mined behaviours - Ad Hoc advertising
63Web Usage MiningApplications
- Web Application Improving
- Performances prefetching, load balance, web
caching, based on user behaviours - Security finding intrusions and frauds .
- Usability adapting the model of the web
application to the model expected by users - Marketing
- Information on users are very important for
e-commerce web sites. - Its possible to obtain data on
- Customer acquisition
- Customer keeping
- Cross sales
- Customer loss
64References
- R. Kosala and H. Blockeel. Web mining research a
survey. SIGKDD Explorations, ACM, 2(1) 1-15,
2000. - Sankar Pal, Varun Talwar, and Pabitra Mitra. Web
mining in soft computing framework Relevance,
state of the art and future directions, 2002. - Jaideep Srivastava, Robert Cooley, Mukund
Deshpande, and Pang-Ning Tan. Web usage mining
Discovery and applications of usage patterns from
web data. SIGKDD Explorations, ACM, 1(2) 12-23,
2000. - Federico Michele Facca, Pier Luca Lanzi Mining
interesting knowledge from weblogs a survey.
Data Knowl. Eng. 53(3) 225-241 (2005)