Title: Project Presentations
1Project Presentations
- Thursday this week, each student will make a
4-minute presentation on their project in class
(with 1 or 2 minutes for questions) - Email me your Powerpoint or PDF slides, with your
name (e.g., joesmith.ppt), before 10am next
Thursday - Suggested content
- Definition of the task/goal
- Description of data sets
- Description of algorithms
- Experimental results and conclusions
- Be visual where possible! (i.e., use figures,
graphs, etc)
2Final Project Reports
- Must be submitted as an email attachment (PDF,
Word, etc) by - 12 noon Tuesday next week
- Use ICS 278 final project report in the subject
line of your email - Report should be self-contained
- Like a short technical paper
- A reader should be able to repeat your results
- Include details in appendices if necessary
- Approximately 1 page of text per section (see
next slide) - graphs/plots dont count include as many of
these as you like. - Can re-use material from proposal and from
midterm progress report if you wish
3Suggested Outline of Final Project Report
- Introduction
- Clear description of task/goals of the project
- Motivation why is this problem interesting
and/or important? - Discussion of relevant literature
- Summarize relevant aspects of prior
published/related work - Technical approach
- Data used in your project
- Exploratory data analysis relevant to your task
- Include as many of plots/graphs as you think are
useful/relevant - Algorithms used in your project
- Clear description of all algorithms used
- Credit appropriate sources if you used other
implementations -
- Experimental Results
- Clear description of your experimental
methodology - Detailed description of your results (graphs,
tables, etc)
4ICS 278 Data MiningLecture 19 Pattern
Discovery Algorithms
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
5Pattern-Based Algorithms
- Global predictive and descriptive modeling
- global models in the sense that they cover
all of the data space - Patterns
- More local structure, only describe certain
aspects of the data - Examples
- A single small very dense cluster in input space
- e.g., a new type of galaxy in astronomy data
- An unusual set of outliers
- e.g., indications of an anomalous event in
time-series climate data - Associations or rules
- If bread is purchased and wine is purchased then
cheese is purchased with probability p - Motif-finding in sequences, e.g.,
- motifs in DNA sequences noisy words in random
background
6General Ideas for Patterns
- Many patterns can be described in the general
form - if condition 1 then condition 2 (with some
certainty) - Probabilistic rules If Age 40 and
education college then income 50k with
probability p - Bumps
- If Age 40 and education college then
mean income 73k - if antecedent then consequent
- if j then v
- where j is generally some box in the input
space - where v is a statement about a variable of
interest, e.g., p(y j ) or E y j - Pattern support
- Support p( j ) or p(j , w )
- Fraction of points in input space where the
condition applies - Often interested in patterns with larger support
7How Interesting is a Pattern?
- Note interestingness is inherently subjective
- Depends on what the data analyst already knows
- Difficult to quantify prior knowledge
- How interesting a pattern is, can be a function
of - How surprising it is relative to prior knowledge?
- How useful (actionable) it is?
- This is a somewhat open research problem
- In general pattern interestingness is difficult
to quantify - Use simple surrogate measures in practice
8How Interesting is a Pattern?
- Interestingness of a pattern
- Measures how interesting the pattern j - v is
- Typical measures of interest
- Conditional probability p( v j )
-
- Change in probability p( v j ) - p( v
) - Lift p( v j ) / p( v ) (also log
of this) - Change in mean target response, e.g., E y j
/Ey
9Pattern-Finding Algorithms
- Typically search a data set for the set of
patterns that maximize some score function - Usually a function of both support and
interestingness - E.g.,
- Association rules
- Bump-hunting
- Issues
- Huge combinatorial search space
- How many patterns to return to the user
- How to avoid problems with redundant patterns
- Statistical issues
- Even in random noise, if we search over a very
large number of patterns, we are likely to find
something that looks significant - This is known as multiple hypothesis testing in
statistics - One approach that can help is to conduct
randomization tests - e.g., for matrix data randomly permute the values
in each column - Run pattern-discovery algorithm resulting
scores provide a null distribution - Ideally, also need a 2nd data set to validate
patterns
10Generic Pattern Finding
Find patterns
Task
Representation
pattern language
f(support, interestingness)
Score Function
Search/Optimization
greedy, branch-and-bound
Data Management
varies
Models, Parameters
list of K highest scoring patterns
11Two Pattern Finding Algorithms
- Bump-hunting the PRIM algorithm
- Bump Hunting in High Dimensional Data
- J. H. Friedman N. I. Fisher
- Statistics and Computing, 2000
- Market basket data association rule algorithms
12Bump-Hunting (PRIM) algorithm
- Patient Rule Induction Method (PRIM)
- Friedman and Fisher, 2000
- Addresses bump-hunting problem
- Assume we have a target variable Y
- Y could be real-valued or a binary class variable
- And we have p input variables
- We want to find boxes j in input space where
EY j EY - or where EY j
- A box j is a conjunctive sentence, e.g.,
- if Age
- Example of a box pattern
- if Age 30 and education bachelor then
Eincome j 120k
13Bump Hunting Extrema Regions for Target f(x)
- let Sj be set of all possible values for input
variable xj - entire input domain is S S1 ? S2 ? ? Sd
- goal find subregion R ? S for which
- mR avg x?R f(x) m
- where m ? f(x) p(x) dx (target mean, over all
inputs) - subregion size as fraction of full space
(support) - ?R ?x?R p(x) dx
- tradeoff between mR and ?R (increase ?R
reduce mR) ... - sample-based estimates used in practice
- ?R (1/n) ?Xi?R 1(Xi?R), yavgR 1/(n?R)
?Xi?R yi - note mR is true quantity of interest, not yavgR
14Greedy Covering
- a generic greedy covering algorithm
- first box B1 induced from entire data set
- second box B2 induced from data not covered by B1
- BK induced from remaining data yi,Xi Xi ?
?j1K-1 Bj - do until either
- estimated target mean f(x) in Bk becomes too
small - yavgK avgyi Xi? Bk Xi ? ?j1K-1 Bj ? (1/n) ?ni1 yi
- support of Bk becomes too small
- ?K (1/n) ?i1n 1(Xi? Bk Xi ? ?j1K-1 Bj)
- then select set of boxes R ?j Bj for some
threshold - for which each yavgj some yavgthreshold or
- yield largest yavgR for which ?R ?i ?i ? some
?threshold
15PRIM algorithm
- PRIM uses patient greedy search on individual
variables - Start with all training data and maximal box
- Repeat until minimal box (e.g., minimal support ?
or n - Shrink box by compressing one face of the box
- For each variable in input space
- Peel off a proportion ? of observations to
optimize Ey new box, - typical ?0.05 or ?0.1
- Now expand the box if Eybox can be increased
(pasting) - Yields a sequence of boxes
- Use cross-validation (on Eybox) to select the
best box - Remove box from training data, then repeat
process
16(No Transcript)
17Comments on PRIM
- Works one variable at a time
- So time-complexity is similar to tree algorithms,
i.e., - Linear in p, and n log n for sorting
- Nominal variables
- Can peel/paste on single values, subsets,
negations, etc - Similar in some sense to CART.but
- More patient in search (removes only small
fraction of data at each step) - Useful for finding pockets in the input space
with high-response - e.g., marketing data small groups of consumers
who spend much more on a given product than the
average consumer - Medical data patients with specific demographics
whose response to a drug is much better than the
average patient
18Marketing Data Example (n9409, p502)
- freq air travel ynum flights/yr, global
mean(y)1.7 - B1 mean(y1)4.2, ?10.08 (8 market seg)
- education 16 yrs income 50K ? missing
- occupation in professional/manager, sales,
homemaker - number of children (
- B2 mean(y2)3.2, ?20.07 (2x global
mean) - education 12 yrs ? missing
- income 30K ? missing 18
- married / dual income in single,
married-one-income - these boxes intuitive nothing really surprising
...
19Pattern Finding Algorithms
- Bump-hunting the PRIM algorithm
- Market basket data association rule algorithms
20Transaction Data and Market Baskets
x
x
x
x
x
x
x
- Supermarket example (Srikant and Agrawal, 1997)
- items 50,000, transactions 1.5 million
- Data sets are typically very sparse
21Market Basket Analysis
- given a (huge) transactions database
- each transaction representing basket for 1
customer visit - each transaction containing set of items
(itemset) - finite set of (boolean) items (e.g. wine, cheese,
diaper, beer, ) - Association rules
- classically used on supermarket transaction
databases - associations Trader Joes customers frequently
buy wine cheese - rule people who buy wine also buy cheese 60
of time - infamous beer diapers example
- in evening hours, beer and diapers often
purchased together - generalize to many other problems, e.g.
- baskets documents, items words
- baskets WWW pages, items links
22Market Basket Analysis Complexity
- usually transaction DB too huge to fit in RAM
- common sizes
- number of transactions 105 to 108 (hundreds
of millions) - number of items 102 to 106
(hundreds to millions) - entire DB needs to be examined
- usually very sparse
- e.g. 0.1 chance of buying random item
- subsampling often a useful trick in DM, but
- here, subsampling could easily miss the (rare)
interesting patterns - thus, runtime dominated by disk read times
- motivates focus on minimizing number of disk scans
23Association Rules Problem Definition
- given set I of items, set T transactions, ?t ?T,
t ? I - Itemset Z a set of items (any subset of I)
- support count ?(Z) num transactions containing
Z - given any itemset Z ? I, ?(Z) t t ?T, Z
? t - association rule
- RX ? Y s,c, X,Y ? I, X?Y?
- support
- s(R) s(X?Y) ?(X?Y)/T p(X?Y)
- confidence
- c(R) s(X?Y) / s(X) ?(X?Y) / ?(X) p(X Y)
- goal find all R such that
- s(R) ? given minsup
- c(R) ? given minconf
24Comments on Association Rules
- association rule RX ? Y s,c
- Strictly speaking these are not rules
- i.e., we could have wine cheese and
cheese wine - correlation is not causation
- The space of all possible rules is enormous
- O( 2p ) where p the number of different items
- Will need some form of combinatorial search
algorithm - How are thresholds minsup and minconf selected?
- Not that easy to know ahead of time how to select
these
25Example
- simple example transaction database (T4)
- Transaction1 A,B,C
- Transaction2 A,C
- Transaction3 A,D
- Transaction4 B,E,F
- with minsup50, minconf50
- R1 A -- C s50, c66.6
- s(R1) s(A,C) , c(R1) s(A,C)/s(A) 2/3
- R2 C -- A s50, c100
- s(R2) s(A,C), c(R2) s(A,C)/s(C) 2/2
s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
26Finding Association Rules
- two steps
- step 1 find all frequent itemsets (F)
- F Z s(Z) ? minsup
(e.g. Za,b,c,d,e) - step 2 find all rules R X -- Y such that
- X ? Y ? F and X ? Y?
(e.g. R a,b,c -- d,e) - s(R) ? minsup and c(R) ? minconf
- step 1s time-complexity typically step 2s
- step 2 need not scan the data (s(X),s(Y) all
cached in step 1) - search space is exponential in I, filters
choices for step 2 - so, most work focuses on fast frequent itemset
generation - step 1 never filters viable candidates for step 2
27Finding Frequent Itemsets
- frequent itemsets Z s(Z)minsup
- Apriori (monotonicity) Principle s(X) ? s(X?Y)
- any subset of a frequent itemset must be frequent
- finding frequent itemsets
- bottom-up approach
- do level-wise, for k1 I
- k1 find frequent singletons
- k2 find frequent pairs (often most costly)
- step k.1 find size-k itemset candidates from the
freq size-(k-1)s of prev level - step k.2 prune candidates Z for which s(Z)
- each level requires a single scan over all the
transaction data - computes support counts ?(Z) t t ?T, Z ?
t for all size-k Z candidates
s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
28Apriori Example (minsup2)
bottleneck
itemset 1,2 1,3 1,5 2,3 2,5 3,5
C2
F1
C1
transactions T 1,3,4 2,3,5 1,2,3,5 2,5
itemset sup 1 2 2 3 3 3 4 1 5 3
itemset sup 1 2 2 3 3 3 5 3
gen
count (scan T)
filter
count (scan T)
F3
itemset sup 2,3,5 2
C2
C3 knows can avoid gen 1,2,3 (and 1,3,5)
apriori, without counting, because 1,2 (1,5)
not freq
itemset sup 1,2 1 1,3 2 1,5 1 2,3 2
2,5 3 3,5 2
F2
filter
itemset sup 1,3 2 2,3 2 2,5 3 3,5 2
C3
itemset sup 2,3,5 2
notice how C3 C3
filter
itemset 2,3,5
count (scan T)
gen
29(No Transcript)
30Problems with Association Rules
- Consider 4 highly correlated items A, B, C, D
- Say p(subset isubset j) minconf for all
possible pairs of disjoint subsets - And p(subset i ? subset j) minsup
- How many possible rules?
- E.g., A-B, A,BC, A,CB, B,CA
- All possible combinations 4 x 23
- In general for K such items, K x 2K-1 rules
- For highly correlated items there is a
combinatorial explosion of redundant rules - In practice this makes interpretation of
association rule results difficult
31References on Association Rules
- Chapter 13 in text (Sections 13.1 to 13.5)
- Early papers
- R. Agrawal and R. Srikant, Fast algorithms for
mining association rules, in Proceedings of VLDB
1994, pp.487-499, 1994. - R. Agrawal et al. Fast discovery of association
rules, in Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996. - More recent
- Good review in Chapter 6 of Data Mining Concepts
and Techniques, J. Han and M. Kamber, Morgan
Kaufmann, 2001. - J. Han, J. Pei, and Y. Yin, Mining frequent
patterns without candidate generation,
Proceedings of SIGMOD 2000, pages 1-12. - Z. Zheng, R. Kohavi, and L. Mason, Real World
Performance of Association Rule Algorithms,
Proceedings of KDD 2001
32Study on Association Rule Algorithms
- Z. Zheng, R. Kohavi, and L. Mason, Real World
Performance of Association Rule Algorithms,
Proceedings of KDD 2001 - Evaluated a variety of association rule
algorithms - Used both real and simulated transaction data
sets - Typical real data set from Web commerce
- Number of transactions 500k
- Number of items 3k
- Maximum transaction size 200
- Average transaction size 5.0
-
33Study on Association Rule Algorithms
- Conclusions
- Very narrow range of minsup yields interesting
rules - Minsup too small too many rules
- Minsup too large misses potentially
interesting patterns - Superexponential growth of rules on real-world
data - Real-world data is different to simulated
transaction data used in research papers, e.g., - Simulated transaction sizes have a mode away from
1 - Real transaction sizes have a mode at 1 and are
highly skewed - Speed-up improvements demonstrated on artificial
data did not generalize to real transaction data -
34Beyond Binary Market Baskets
- counts (vs yes/no)
- e.g. 3 wines vs wine
- quantitative (non-binary) item variables
- popular discretize real variable into k binary
variables - e.g. age3039,incomeK4248 ? buys_PC
- Item hierachies
- Common in practice, e.g., clothing - shirts -
mens shirts, etc - Can learn rules that generalize across the
hierarchy - mining sequential associations/patterns and rules
- e.g. 1_at_0,2_at_5 ? 4_at_15
35Association Rule Finding
Find association rules
Task
Representation
A and B C
P(A,B,C) minsup, P(CA, B) minconf
Score Function
Breadth-first candidate generation
Search/Optimization
Data Management
Linear scans
Models, Parameters
list of all rules satisfying thresholds
36Bump Hunting (PRIM)
Find high score bumps
Task
Representation
A,B EyA,B Ey
Score Function
EyA,B and p(A,B)
Search/Optimization
Greedy search
Data Management
None
Models, Parameters
Set of boxes
37Summary
- Pattern finding
- An interesting and challenging problem
- How to search for interesting/unusual regions
of a high-dimensional space - Two main problems
- Combinatorial search
- How to define interesting (this is the harder
problem) - Two examples of algorithms
- PRIM for bump-hunting
- Apriori for association rule mining
- Many open problems in this research area (room
for new ideas!)