Prof. Dr. Bettina Berendt - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Dr. Bettina Berendt

Description:

Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ. Berlin, Germany – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 59
Provided by: peopleCsK
Category:
Tags: berendt | bettina | prof | weka

less

Transcript and Presenter's Notes

Title: Prof. Dr. Bettina Berendt


1
Web Usage Mining Modelling frequent-pattern
mining I (sequence mining with WUM),
classification and clustering)
  • Prof. Dr. Bettina Berendt
  • Humboldt Univ. Berlin, Germany
  • www.berendt.de

2
Please note
  • These slides use and/or refer to a lot of
    material available on the Internet. To reduce
    clutter, credits and hyperlinks are given in the
    following ways
  • Slides adapted from other peoples materials at
    bottom of slide
  • Pictures, screenshots etc. URL visible in
    screenshot or given in PPT Comments field
  • Literature, software On the accompanying Web
    site http//vasarely.wiwi.hu-berlin.de/WebMining07
    /
  • Thanks to the Internet community!
  • You are invited to re-use these materials, but
    please give the proper credit.

3
Stages of knowledge discoverydiscussed in this
lecture
Application understanding
4
An addendum to the association rulesmain
interestingness measures of association
rules(and a recommendation for postprocessing
the result set)
  • Support of a rule A ? B
  • no. of instances with A and B / no. of all
    instances
  • Confidence of a rule A ? B
  • no. of instances with A and B / no. of
    instances with A
  • support (A B) / support (A)
  • Lift of a rule A ? B
  • support (A B) / support (A) support (B)
  • What does this measure, and in what numerical
    interval can it be?
  • Deleting redundant rules from the result set
  • If you have A ? B and A C ? B, the second rule
    is redundant.

5
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
6
  • Demonstration of WUM

7
The site
Business understanding / problem definition
How do users search in this online catalog?
Which search criteria are popular? Which are
efficient?
Berendt Spiliopoulou, VLDB Journal 2000
8
The concept hierarchies / site ontology(excerpt)
SEITE1-...LI (1st page of a list) or SEITEn-...LI
(further page)
LA (Land)
SA (Schulart)
SU (Suche)
9
Sequence mining one result pattern successful
search for a school in Germany
  • select t
  • from node a b, template a b as t
  • where a.url startswith "SEITE1-"
  • and a.occurrence 1
  • and b.url contains "1SCHULE"
  • and b.occurrence 1
  • and (b.support / a.support) gt 0.2

a refinement
a repetition
a continuation
one example pattern
(Berendt Spiliopoulou, VLDB J. 2000)
10
Sequences
11
Generalized sequences, navigation patterns, hits
in WUM
12
Aggregated Logs The basic internal
representation in WUM
13
The confi-dence measure for genera-lized sequences
14
Templates in the query language MINT,
g-sequences, and navigation patterns
15
Interestingness measures Support (hits) and
confidence
16
Aggregated Logs, queries, and query results
17
The basic idea of the WUM algorithm
18
MINT can express 3 types of constraints
(predicates)
19
The WUM gseqm algorithm
(B predicates)
20
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
21
What makes people happy? a corpus-based
approach tofinding happiness
22
Bayes formula and its use for classification
  • 1. Joint probabilities and conditional
    probabilities basics
  • P(A B) P(AB) P(B) P(BA) P(A)
  • ? P(AB) ( P(BA) P(A) ) / P(B) (Bayes
    formula)
  • P(A) prior probability of A (a hypothesis, e.g.
    that an object belongs to a certain class)
  • P(AB) posterior probability of A (given the
    evidence B)
  • 2. Estimation
  • Estimate P(A) by the frequency of A in the
    training set (i.e., the number of A instances
    divided by the total number of instances)
  • Estimate P(BA) by the frequency of B within the
    class-A instances (i.e., the number of A
    instances that have B divided by the total number
    of class-A instances)
  • 3. Decision rule for classifying an instance
  • If there are two possible hypotheses/classes (A
    and A), choose the one that is more probable
    given the evidence
  • (A is not A)
  • If P(AB) gt P(AB), choose A
  • The denominators are equal ? If ( P(BA) P(A) )
    gt ( P(BA) P(A) ), choose A

23
Simplifications and Naive Bayes
  • 4. Simplify by setting the priors equal (i.e., by
    using as many instances of class A as of class
    A)
  • ? If P(BA) gt P(BA), choose A
  • 5. More than one kind of evidence
  • General formula
  • P(A B1 B2 ) P(A B1 B2 ) / P(B1 B2)
    P(B1 B2 A) P(A) / P(B1 B2) P(B1 B2
    A) P(B2 A) P(A) / P(B1 B2)
  • Enter the naive assumption B1 and B2 are
    independent given A
  • ? P(A B1 B2 ) P(B1A) P(B2A) P(A) /
    P(B1 B2)
  • By reasoning as in 3. and 4. above, the last two
    terms can be omitted
  • ? If (P(B1A) P(B2A) ) gt (P(B1A) P(B2A)
    ), choose A
  • The generalization to n kinds of evidence is
    straightforward.
  • These kinds of evidence are often called features
    in machine learning.

24
Example Texts as bags of words
  • Common representations of texts
  • Set can contain each element (word) at most once
  • Bag (aka multiset) can contain each word
    multiple times (most common representation used
    in text mining)
  • Hypotheses and evidence
  • A The blog is a happy blog, the email is a spam
    email, etc.
  • A The blog is a sad blog, the email is a
    proper email, etc.
  • Bi refers to the ith word occurring in the whole
    corpus of texts
  • Estimation for the bag-of-words representation
  • Example estimation of P(B1A)
  • number of occurrences of the first word in all
    happy blogs, divided by the total number of words
    in happy blogs (etc.)

25
WEKA NaiveBayes and NaiveBayesMultinomial
  • The WEKA classifier learning scheme
    NaiveBayesMultinomial implements this model of
    the probability that a word occurs in a document
    given that the document is in that classs.
  • Its output is a table giving these probabilities
  • The WEKA classifier learning scheme NaiveBayes
    assumes that the attributes are normally
    distributed.
  • Needed when the attributes are numerical and not
    necessarily 0 1
  • Its output describes the parameters of these
    normal distributions
  • Explanation of the annotations of the attributes
  • http//grb.mnsu.edu/grbts/doc/manual/Naive_Bayes.
    html
  • Explanation of the error measures
  • http//grb.mnsu.edu/grbts/doc/manual/Error_Measure
    ments.htmlsecerror

26
The happiness factor of Mihalcea Liu (2006)
  • Starting with the features identified as
    important by the Naïve Bayes classifier (a
    threshold of 0.3 was used in the feature
    selection process), we selected all those
    features that had a total corpus frequency higher
    than 150, and consequently calculate the
    happiness factor of a word as the ratio between
    the number of occurrences in the happy blogposts
    and the total frequency in the corpus.
  • ? What is the relation to the Naïve Bayes
    estimators?

27
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
28
Clustering by information contained in the
objects to be clustered (here documents contain
text) www.kartoo.com
29
The basic idea of clustering group similar things
Group 1
Group 2
Attribute 2
Attribute 1
Based on http//rakaposhi.eas.asu.edu/cse494/notes
/f02-clustering.ppt
30
Idea and Applications
  • Clustering is the process of grouping a set of
    physical or abstract objects into classes of
    similar objects.
  • It is also called unsupervised learning.
  • It is a common and important task that finds many
    applications.
  • Applications in text analysis / Web content
    mining, e.g. for search engines
  • Structuring search results
  • Suggesting related pages
  • Automatic directory construction/update
  • Finding near identical/duplicate pages
  • Applications in Web usage mining
  • Customer/user segmentation
  • User segmentation for recommender systems /
    personalization

Based on http//rakaposhi.eas.asu.edu/cse494/notes
/f02-clustering.ppt
31
Concepts in Clustering
  • Defining distance between points
  • Cosine distance (which you already know)
  • Overlap distance
  • A good clustering is one where
  • (Intra-cluster distance) the sum of distances
    between objects in the same cluster are
    minimized,
  • (Inter-cluster distance) while the distances
    between different clusters are maximized
  • Objective to minimize F(Intra,Inter)
  • Clusters can be evaluated with internal as well
    as external measures
  • Internal measures are related to the inter/intra
    cluster distance
  • External measures are related to how
    representative are the current clusters to true
    classes
  • See entropy and F-measure

Based on http//rakaposhi.eas.asu.edu/cse494/notes
/f02-clustering.ppt
32
Inter/Intra Cluster Distances
  • Intra-cluster distance
  • (Sum/Min/Max/Avg) the (absolute/squared) distance
    between
  • All pairs of points in the cluster OR
  • Between the centroid and all points in the
    cluster OR
  • Between the medoid and all points in the
    cluster
  • Inter-cluster distance
  • Sum the (squared) distance between all pairs of
    clusters
  • Where distance between two clusters is defined
    as
  • distance between their centroids/medoids
  • (Spherical clusters)
  • Distance between the closest pair of points
    belonging to the clusters
  • (Chain shaped clusters)

From http//rakaposhi.eas.asu.edu/cse494/notes/f02
-clustering.ppt
33
How hard is clustering?
  • One idea is to consider all possible clusterings,
    and pick the one that has best inter and intra
    cluster distance properties
  • Suppose we are given n points, and would like to
    cluster them into k-clusters
  • How many possible clusterings?
  • Too hard to do it brute force or optimally
  • Solution Iterative optimization algorithms
  • Start with a clustering, iteratively improve it
    (eg. K-means)

From http//rakaposhi.eas.asu.edu/cse494/notes/f02
-clustering.ppt
34
Classical clustering methods
  • Partitioning methods
  • k-Means (and EM), k-Medoids
  • Hierarchical methods
  • agglomerative, divisive, BIRCH
  • Model-based clustering methods

From http//rakaposhi.eas.asu.edu/cse494/notes/f02
-clustering.ppt
35
K-means
  • Works when we know k, the number of clusters we
    want to find
  • Idea
  • Randomly pick k points as the centroids of the
    k clusters
  • Loop
  • For each point, put the point in the cluster to
    whose centroid it is closest
  • Recompute the cluster centroids
  • Repeat loop (until there is no change in clusters
    between two consecutive iterations.)

Iterative improvement of the objective function
Sum of the squared distance from each point to
the centroid of its cluster
From http//rakaposhi.eas.asu.edu/cse494/notes/f02
-clustering.ppt
36
K Means Example (K2) For a more complex
simulation, see http//www.cs.tu-bs.de/rob/lehre/
bv/Kmeans/Kmeans.html
Reassign clusters
Converged!
Based on http//rakaposhi.eas.asu.edu/cse494/notes
/f02-clustering.ppt
37
A map of documents, grouped by their topics
38
DocumentAtlas A two-step procedure
  • Latent semantic indexing Project documents into
    a semantic space (dimensionality reduction and
    identification of commonalities even if
    vocabulary is different)
  • Multidimensional scaling Project that space into
    2D, preserving the distances as well as possible
  • Input a set of documents
  • Output a document map

39
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
40
Clustering by information contained in the
objects to be clustered (here documents contain
text) www.kartoo.com
41
Clustering by information associated with the
objects to be clustered (here photos are
associated with tags) www.flickr.com
42
Clustering by information associated with the
objects to be clustered (here queries are
associated with document texts) (1)
43
Clustering by information associated with the
objects to be clustered ... (2) Baeza-Yates,
Query Mining, ECIR 2005
  1. Create instances of past ( query result set )
    combinations
  2. Cluster them by the textual similarity of the
    (viewed) result documents
  3. Use this to recommend a better / an additional
    query

New user
44
Ranking by similarity and popularity Examples
45
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
46
Internet users are worried about their privacy ...
  • (results from a meta-study of 30
    questionnaire-based studies TK03)

47
... but are they really?An online shop with a
difference
Berendt, Günther, Spiekermann, Communications
of the ACM,2005
48
Privacy-related behaviour
Shopping for cameras
Shopping for jackets
Berendt, Data Mining and Knowledge Discovery,
2002, Berendt, Postproc. WebKDD 2002
49
Finding People are willing to exchange privacy
for personalization benefits
  • Users would provide, in return for personalized
    content, information on their name (88),
    education (88), age (86), hobbies (83), salary
    (59), or credit card number (13).
  • 27 of Internet users think tracking allows the
    site to provide information tailored to specific
    users.
  • 73 of online users find it useful if site
    remembers basic information such as name and
    address.
  • People are willing to give information to receive
    a personalized online experience 51 or 40,
    depending on the study.
  • TK03

50
User-centric evaluation An experimental
investigation of the effect of explaining the
personalization-privacy tradeoff
  • KT05 compared the effects of traditional
    privacy statements with that of a contextualized
    explanation on users willingness to answer
    questions about themselves and their (product)
    preferences.
  • In the contextualized-explanation condition,
    participants
  • answered 8.3 more questions (gave at least one
    answer) (plt0.001),
  • gave 19.6 more answers (plt0.001),
  • purchased 33 more often (plt0.07) ,
  • stated that their data had helped the Web store
    to select better books (plt0.035) even though
    the recommendations were static and identical for
    both groups.

(screenshot from Teltzrow, M. Kobsa, A. (2004).
Communication of Privacy and Personalization in
E-Business. In Proceedings of the Workshop
WHOLES A Multiple View of Individual Privacy in
a Networked World, Stockholm, Sweden.
51
But what is privacy?Is it only about data
protection?
  • Phillips, D.J. 2004. Privacy Policy and PETs
    The Influence of Policy Regimes on the
    Development and Social Implications of Privacy
    Enhancing Technologies. New Media Society
    6(6) 691-706
  • freedom from intrusion
  • construction of the public/private divide
  • separation of identities
  • protection from surveillance (the right to choose
    belonging)

52
Also whose privacy? Stakeholders and privacy
interestsa (partially) fictitious example
  • users of the system
  • passengers
  • system administrators
  • Other stakeholders
  • airport administration
  • airport security
  • airlines
  • duty-free shop

53
Different privacy interests of the different
stakeholders
54
Agenda
A very short note on other uses of clustering
(e.g. in query mining)
Some observations on privacy ...
Best-practice design patternswith
open-source tools
55
In the preparation of a log file(recommendations
for open-source tools are shown in green)
  1. Use qualitative methods for application
    understanding (read!)
  2. Inspect the site and the URLs for data
    understanding
  3. Generate Analog reports for getting base
    statistics of usage
  4. Build concept system / hierarchy and mapping
    URLs ? concepts (notation WUMprep regex)
  5. Use WUMprep for data preparation
  6. Remove unwanted entries (pictures etc.)
  7. Sessionize
  8. Remove robots
  9. Replace URLs by concepts
  10. (Build a database)
  11. Use WEKA for modelling
  12. Transform log file into ARFF (WUMprep4WEKA)
  13. Cluster, classify, find association rules, ...
  14. Use WUM for modelling
  15. Select patterns based on objective
    interestingness measures (support, confidence,
    lift, ...) and on subjective interestingness
    measures (unexpected? Application-relevant?)
  16. Present results in tabular, textual and graphical
    form (use Excel, ...)
  17. Interpret the results
  18. Make recommendations for site improvement etc.

56
In the case study
  1. Use qualitative methods for application
    understanding (read!)
  2. Inspect the site and the URLs for data
    understanding
  3. Generate Analog reports for getting base
    statistics of usage
  4. Build concept system / hierarchy and mapping
    URLs ? concepts (notation WUMprep regex)
  5. Use WUMprep for data preparation
  6. Remove unwanted entries (pictures etc.)
  7. Sessionize
  8. Remove robots
  9. Replace URLs by concepts
  10. (Build a database)
  11. Use WEKA for modelling
  12. Transform log file into ARFF (WUMprep4WEKA)
  13. Cluster, classify, find association rules, ...
  14. Use WUM for modelling
  15. Select patterns based on objective
    interestingness measures (support, confidence,
    lift, ...) and on subjective interestingness
    measures (unexpected? Application-relevant?)
  16. Present results in tabular, textual and graphical
    form (use Excel, ...)
  17. Interpret the results
  18. Make recommendations for site improvement etc.

57
The preparation of texts(e.g., for an automatic
version of step 2.2.)
  • Is quite involved when done properly
  • (a good introduction to preprocessing for text
    mining can be found in
  • Grobelnik, M., Mladenic, D. Text Mining
    Tutorial.
  • http//eprints.pascal-network.org/archive/00000017
    /01/Tutorial_Marko.pdf )
  • However, as a first step, you can also use the
    raw text of documents (generated with only a few
    of the tools in the TextGarden library).

58
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com