Diapositive 1 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Diapositive 1

Description:

1. The elements of statistical learning, Hastie & Co ... Apriori algo return a set of minimum support subset. Lk 12. Example 2 from Agrawal. B E F ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 29
Provided by: Omn88
Category:
Tags: algo | diapositive

less

Transcript and Presenter's Notes

Title: Diapositive 1


1
The elements of statistical learning, Hastie Co
Chapter 14 UNSUPERVISED LEARNING - Episode 1 of
3 1 INTRODUCTION 2 ASSOCIATION RULES
2
14.1 Introduction
  • Supervised
  • Output Y(Y1,,Ym)
  • Predictor X(X1,,Xp)
  • Loss function L(y,y)
  • (X,Y) r.v. with joint proba density P(X,Y)
  • Density estimation problem
  • Interest in properties of P(YX)
  • (location properties)
  • Ex µ(x)argmin? EYX L(Y,?)
  • Simple since m1 (often)
  • Unsupervised
  • Random p-vector X(X1,,Xp)
  • having proba density P(X)
  • Interest in properties of P(X)
  • (not location properties)
  • If punsupervised ?3 effective non parametric
    methods for direct estimation of P(X), that fail
    in high dimension
  • punsupervisedgtgtpsupervised
  • Complications

3
Unsupervised learning
  • - Association rules construct simple
    descriptions that describe regions of high
    density special case of high dimensional binary
    valued data
  • Episode 1 TODAY
  • Cluster analysis find multiple convex regions
    that contains modes of P(X)
  • mixture modeling
  • Episode 2 Next Friday
  • - PC, multidimensional scaling, self-organized
    maps, PCurves, identify low dimensions
    manifolds that represent high data density
  • ? Info about association between variables
  • Episode 3 Last Friday
  • Measure of success heuristic

4
14.2 Association rules - Agenda
2.0 Introduction 2.1 Example market basket
analysis 2.2 The APRIORI algorithm 2.3
Demographic example (1) 2.4 Unsupervised as
supervised learning 2.5 Generalized association
rules 2.6 Choice of supervised learning
method 2.7 Demographic example (2)
5
2.0 Introduction
  • - SOURCE mining commercial Data Bases (DB)
  • - GOAL to find joint values of the variables
    X (X1,,Xp) appearing most frequently in the DB
  • APPLY on binary data Xj0,1, "Market Basket
    analysis"
  • Observations sales transactions
  • Variables all of the items sold in the store
  • For observation i , each variable Xj
  • xij1 if jth item purchased, xij0 else.

6
2.0 Introduction
- BASIC GOAL find a collection of prototype
X-values v1,vL s.t. P(vl) is relatively
large Natural estimator for P(vl) is the
relative frequency BUT in high dimension X vl
is rare ? no reliable estimation - 1st
simplification seeking regions of the X-space
with high proba content (rather than values x
where P(x) is large) Notation Sjall possible
values of Xj, sj? Sj Goal find s1,, sp s.t.
is relatively large
Conjunctive rule
7
2.1 Market Basket Analysis
- Further simplification 2 types of subsets
either sj voj, (single value) or sj Sj
Reformulation find subset of integers J ?
1,,p and voj, j ?J s.t. is relatively
large
8
2.1 Market Basket Analysis
- Dummy variables Sj ? new variables Z1,,ZK
where Re-reformulation find subset of
integers K ? 1,,K s.t. is relatively
large K item set Estimation prevalence
(support) of K In association rules mining t
threshold is specified ? seeking Kl T(Kl) gt
t
9
2.2 The Apriori algorithm (Agrawal R.)
- Solving previous problem with small number of
passes over data (very large DB) - Relies on 1)
K T(K) gt t is small 2) L ? K ? T(L)
gtT(K) 1st pass compute the support of all
single item sets those whose support lt t are
discard 2nd pass compute the support of all
item sets of size 2 those whose support lt t are
discard ? to generate all frequent item sets
with K m, we need to consider only candidates
s.t. all of their m ancestral item sets of size
m-1 are frequent. ? better on an example !!!!
10
Example 1 from Agrawal 1999 (ULg) t2
1st pass
L1
2nd pass
L2
L3
11
Association rules? (Agrawal)
I set of items T transaction a set of items
s.t. I ? T D database set of transactions An
association rule is an implication of the form
X?Y where X, Y ? I The rule X?Y holds in D with
confidence c if c of the transactions in D
that contain X also contain Y The rule X ?Y has
support s in D if s of the transactions in D
contain X?Y GOAL find all rules that have
support and confidence greater than specified
minimum support and minimum confidence Apriori
algo return a set of minimum support subset ?Lk
12
Example 2 from Agrawal
For the rule A ? C - Supportsupport(A,C)T(A?C
)50 - ConfidenceC(A ? C)support(A,C)/suppor
t(A) T(A?C)/T(A)50/7566.6 -
LiftL(A?C) C(A?C)/T(C)
13
Output
A?B s.t. T(A?B)gtt and C(A?B)gtc Problem if
rules with high confidence or lift but low
support Vodka ? caviar
14
2.3 Demographic example
N9409
15
2.3 Demographic example
Results of the algorithm 6288 association rules
involving less than 5 predictors, with support gt
10 3 of the association rules p446
16
2.4 Unsupervised as supervised learning
Technique for transforming the density estimation
problem into a supervised learning
approximation g(x) unknown data probability
density (to be estimated) g0(x) specified
probability density function used for reference
(ex uniform) Dataset x1,,xN iid random sample
from g(x) - Generate sample of size N0 from
g0(x) - Pooling samples and mass
assigning wN0/(NN0) w0N/(NN0) ? resulting
in a random sample drawn from the mixture
density g(x)g0(x)/2
17
Assigning Y1 to sample points from g(x) Y0
to sample points from g0(x) Then can be
estimated by supervised learning, using
(y1,x1),,(yNN0,xNN0) ? ?
18
Generalized version of logistic regression well
suited for this application Log-odds directly
estimated and
19
Choice of g0(x) depends on g(x) and procedure to
estimate µ(x), and on the goal If accuracy g0(x)
s.t. easier estimation If departures from
uniformity g0(x) uniform density over the
variables' range If departures from independence
Rem N0?N ?computation and memory problems
20
2.5 Generalized Association Rules
General problem find s1,, sp s.t. is
large Here using supervised learning, moderately
sized datasets Re-re-reformulation find subset
of integers J ? 1,,p and corresponding sj,
s.t. is large ? (Xj?sj)j?J generalized item
set Heuristic research
21
Basket Market analysis and generalized
formulation reference the uniform distr. ?
seeking item sets that are more frequent than
would be expected if all joint values (x1,,xN)
were uniformly distributed. This favors the
discovery of item sets whose marginal
constituents (Xj?sj) are individually
frequent is large Conjunctions of such
frequent subsets will tend to appear more often
among item sets of high support than
conjunctions of marginally less frequent
subsets This is why the rule "Vodka ? Caviar" is
not discovered
22
Reference to U-distr can cause highly frequent
item sets WITH low association among constituents
to dominate the collection of highest support
item sets ? Using as reference distribution
removes the preference for highly frequent
values of individual variables in the discovered
item sets ? Rules like "Vodka ? Caviar" can
emerge!
23
RESUME 1 Choice of reference distribution 2
Drawing sample 3 Supervised learning problem
with output Y0,1 GOAL using the training data
to find S.t. target function µ(x)E(YX) is
large And also not too small data support
24
2.6 Choice of the supervised method
The results R are defined by conjunctive rules ?
supervised methods that learn such rules would be
appropriate CART decision tree and PRIM
25
2.7 Demographic example
Illustration of the PRIM method on the same data
set as APRIORI algorithm 3 of the high-support
item sets p452 Generalized association rules
derived from these item sets with c?95 (3 of the
4 presented)
26
(No Transcript)
27
APRIORI algorithm Exhaustive ? find ALL rules
with support greater than t Dummy variables PRIM
or CART Greedy ? no guaranteed optimal set of
rules
28
Conclusion - Remark
APRIORI algorithm basic presentation (available
improvements and generalization) Too much text!!!
See you next week for Cluster Analysis
Write a Comment
User Comments (0)
About PowerShow.com