David Newman, UC Irvine Lecture 17: Pattern Finding 1 - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

David Newman, UC Irvine Lecture 17: Pattern Finding 1

Description:

Final project report due Thurs Dec 13th. please bring printed copy to class and email me a copy ... associations: Trader Joe's customers frequently buy wine & cheese ... – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 68

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 17: Pattern Finding 1

1
CS 277 Data MiningLecture 17 Pattern
Discovery Algorithms

David Newman
Department of Computer Science
University of California, Irvine

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2
Notices

Presentations (10 minute)
Thurs Dec 6th (5 students) usual time 5pm
Thurs Dec 13th (5 students) Note time 4pm
Volunteers for presenting Dec 6th? Otherwise
random
Final project report due Thurs Dec 13th
please bring printed copy to class and email me a
copy
Will give instructions for presentations and
final report

3
Project Presentations

Thursday following two weeks, each student will
make an in-class 10-minute presentation on their
project (with 1 or 2 minutes for questions)
Email me your PowerPoint or PDF slides, with your
name (e.g., joesmith.ppt), before 3pm on the day
you are presenting
Suggested content
Definition of the task/goal
Description of data sets
Description of algorithms
Experimental results and conclusions
Be visual where possible! (use figures, graphs,
etc)

4
Final Project Reports

Must be submitted as an email attachment (PDF,
Word, etc) by 3pm on Thursday Dec 13
Use ICS 278 final project report in the subject
line of your email
Report should be self-contained
Like a short technical paper
A reader should be able to repeat your results
Include details in appendices if necessary
Approximately 1 page of text per section (see
next slide)
graphs/plots dont count include as many of
these as you like.
Can re-use material from proposal and from
midterm progress report if you wish

5
Suggested Outline of Final Project Report

Introduction
Clear description of task/goals of the project
Motivation why is this problem interesting
and/or important?
Discussion of relevant literature
Summarize relevant aspects of prior
published/related work
Technical approach
Data used in your project
Exploratory data analysis relevant to your task
Include as many of plots/graphs as you think are
useful/relevant
Algorithms used in your project
Clear description of all algorithms used
Credit appropriate sources if you used other
implementations
Experimental Results
Clear description of your experimental
methodology
Detailed description of your results (graphs,
tables, etc)

6
Homework 3 review

K-means
SVD (LSI)
NMF
PLSI

7
K-means

Classic, Euclidean version
Euclidean objective Q
E-Step (Expectation)
rdk s.t. Q minimized
M-Step (Maximization)

8
K-means

Cosine similarity version
Correct update for objective Q?
E-Step
rdk s.t. Q maximized
M-Step

???
9
K-means

Iterations till convergence
20, 1, 5, 15, 12, 10, 5-15
Do reality check on topics printed
You have 4 methods to find topics
You should have some prior expectation that the
different methods will produce similar topics
therefore if you have a method that produces
strange topics, take a second look

10
SVD

W,D size(X)
U,S,V svds(X,K)
for k1K
xsort,isort sort(-abs(U(,k)))
fprintf('vector d ', k)
for i112
fprintf('s ', wordisort(i))
end
fprintf('\n')
end

11
SVD Time complexity

Dense SVD of D x W matrix X
Time min DW2, WD2
Google matlab svd
svd uses LAPACK DGESVD
Time complexity less if
Sparse
Computing K-leading singular values / singular
vectors

12
Matlab svd, svds

svd
uses LAPACK DGESVD
svds
uses eigs to find eigenvalues/eigenvectors of
eigs uses ARPACK (Arnoldi Package)
ARPACK requires (routine for) y Ax
y A x
dense O(n2 )
sparse O(n)
? Estimate work for svds

13
Space Complexity

Size parameters
D documents
W words
N total words, M nonzero entries (M N/2)
L average words per document (L N/D)
K topics
Data many of you said space DW
PubMed D 107, W105 4 Byte integers
need 4 1012 4,000 GB memory
memory, not disk
this is exceptionally large
DONT IGNORE SPARSENESS

14
Space Complexity

D documents
W words
N total words, M nonzero entries (M N/2)
L average words per document
K topics

Parameters (typically) topics K W mixes D
K total K (DW)
Data N D L e D W
Assumptions K
Data Parameters D (L K)
15
Complexity comparison
16
Complexity comparison
no free lunch
17
PLSI

Some of you did
prob_d_w zeros(D,W)
this is not scalable
D.W likely to be bigger than K.D.L
Space p(zw,d)
K D L (largest of all!!!)
for k1K
prob_z_given_w_dk sparse(W,D)
end

18
Complexity of operations on n x n matrix A

Dense A
y A x
x inv(A) y
A x l x
det(A)

Sparse A
y A x
x inv(A) y
A x l x
det(A)

Lecture
Pattern Discovery Algorithms

20
Pattern-Based Algorithms

Global predictive and descriptive modeling
global models in the sense that they cover all of
the data space
Patterns
More local structure, only describe certain
aspects of the data
Examples
A single small very dense cluster in input space
e.g., a new type of galaxy in astronomy data
An unusual set of outliers
e.g., indications of an anomalous event in
time-series climate data
Associations or rules
If bread is purchased and wine is purchased then
cheese is purchased with probability p
Motif-finding in sequences, e.g.,
motifs in DNA sequences noisy words in random
background

21
General Ideas for Patterns

Many patterns can be described in the general
form
if condition 1 then condition 2 (with some
certainty)
Probabilistic rules If Age 40 and
education college then income 50k with
probability p
Bumps
If Age 40 and education college then
mean income 73k
if antecedent then consequent
if j then v
where j is generally some box in the input space
where v is a statement about a variable of
interest, e.g., p(y j ) or E y j
Pattern support
Support p( j ) or p(j , w )
Fraction of points in input space where the
condition applies
Often interested in patterns with larger support

22
How Interesting is a Pattern?

Note interestingness is inherently subjective
Depends on what the data analyst already knows
Difficult to quantify prior knowledge
How interesting a pattern is, can be a function
of
How surprising it is relative to prior knowledge?
How useful (actionable) it is?
This is a somewhat open research problem
In general pattern interestingness is difficult
to quantify
? Use simple surrogate measures in practice

23
How Interesting is a Pattern?

Interestingness of a pattern
Measures how interesting the pattern j - v is
Typical measures of interest
Conditional probability p( v j )
Change in probability p( v j ) - p( v
)
Lift p( v j ) / p( v ) (also log
of this)
Change in mean target response, e.g., E y j
/Ey

24
Pattern-Finding Algorithms

Typically search a data set for the set of
patterns that maximize some score function
Usually a function of both support and
interestingness
E.g.,
Association rules
Bump-hunting
Issues
Huge combinatorial search space
How many patterns to return to the user
How to avoid problems with redundant patterns
Statistical issues
Even in random noise, if we search over a very
large number of patterns, we are likely to find
something that looks significant
This is known as multiple hypothesis testing in
statistics
One approach that can help is to conduct
randomization tests
e.g., for matrix data randomly permute the values
in each column
Run pattern-discovery algorithm resulting
scores provide a null distribution
Ideally, also need a 2nd data set to validate
patterns

25
Transaction Data and Market Baskets
x
x
x
x
x
x
x

Supermarket example (Srikant and Agrawal, 1997)
items 50,000, transactions 1.5 million
Data sets are typically very sparse

26
Market Basket Analysis

given a (huge) transactions database
each transaction representing basket for 1
customer visit
each transaction containing set of items
(itemset)
finite set of (boolean) items (e.g. wine, cheese,
diaper, beer, )
Association rules
classically used on supermarket transaction
databases
associations Trader Joes customers frequently
buy wine cheese
rule people who buy wine also buy cheese 60
of time
infamous beer diapers example
in evening hours, beer and diapers often
purchased together
generalize to many other problems, e.g.
baskets documents, items words
baskets WWW pages, items links

27
Market Basket Analysis Complexity

usually transaction DB too huge to fit in RAM
common sizes
number of transactions 105 to 108 (hundreds
of millions)
number of items 102 to 106
(hundreds to millions)
entire DB needs to be examined
usually very sparse
e.g. 0.1 chance of buying random item
subsampling often a useful trick in DM, but
here, subsampling could easily miss the (rare)
interesting patterns
thus, runtime dominated by disk read times
motivates focus on minimizing number of disk scans

28
Association Rules
From Ullmans data mining lectures http//infolab
.stanford.edu/ullman/mining/2006/index.html

Market Baskets
Frequent Itemsets
A-priori Algorithm

29
The Market-Basket Model

A large set of items, e.g., things sold in a
supermarket.
A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.

30
Support

Simplest question find sets of items that appear
frequently in the baskets.
Support for itemset I the number of baskets
containing all items in I.
Given a support threshold s, sets of items that
appear in s baskets are called frequent
itemsets.

31
Example

Itemsmilk, coke, pepsi, beer, juice.
Support 3 baskets.
B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
Frequent itemsets m, c, b, j, m,
b, c, b, j, c.

32
Applications --- (1)

Real market baskets chain stores keep terabytes
of information about what customers buy together.
Tells how typical customers navigate stores, lets
them position tempting items.
Suggests tie-in tricks, e.g., run sale on
diapers and raise the price of beer.
High support needed, or no s .

33
Applications --- (2)

Baskets documents items words in those
documents.
Lets us find words that appear together unusually
frequently, i.e., linked concepts.
Baskets sentences, items documents
containing those sentences.
Items that appear together too often could
represent plagiarism.

34
Applications --- (3)

Baskets Web pages items linked pages.
Pairs of pages with many common references may be
about the same topic.
Baskets Web pages p items pages that
link to p .
Pages with many of the same links may be mirrors
or about the same topic.

35
Important Point

Market Baskets is an abstraction that models
any many-many relationship between two concepts
items and baskets.
Items need not be contained in baskets.
The only difference is that we count
co-occurrences of items related to a basket, not
vice-versa.

36
Scale of Problem

WalMart sells 100,000 items and can store
billions of baskets.
The Web has over 100,000,000 words and billions
of pages.

37
Association Rules

If-then rules about the contents of baskets.
i1, i2,,ik ? j means if a basket contains
all of i1,,ik then it is likely to contain j.
Confidence of this association rule is the
probability of j given i1,,ik.

38
Example

B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
An association rule m, b ? c.
Confidence 2/4 50.

_ _

39
Interest

The interest of an association rule X ? Y
is the absolute value of the amount by which the
confidence differs from the probability of Y.

40
Example

B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
For association rule m, b ? c, item c appears
in 5/8 of the baskets.
Interest 2/4 - 5/8 1/8 --- not very
interesting.

41
Relationships Among Measures

Rules with high support and confidence may be
useful even if they are not interesting.
We dont care if buying bread causes people to
buy milk, or whether simply a lot of people buy
both bread and milk.
But high interest suggests a cause that might be
worth investigating.

42
Finding Association Rules

A typical question find all association rules
with support s and confidence c.
Note support of an association rule is the
support of the set of items it mentions.
Hard part finding the high-support (frequent )
itemsets.
Checking the confidence of association rules
involving those sets is relatively easy.

43
Association Rules Problem Definition

given set I of items, set T transactions, ?t ?T,
t ? I
Itemset Z a set of items (any subset of I)
support count ?(Z) num transactions containing
Z
given any itemset Z ? I, ?(Z) t t ?T, Z
? t
association rule
RX ? Y s,c, X,Y ? I, X?Y?
support
s(R) s(X?Y) ?(X?Y)/T p(X?Y)
confidence
c(R) s(X?Y) / s(X) ?(X?Y) / ?(X) p(X Y)
goal find all R such that
s(R) ? given minsup
c(R) ? given minconf

44
Comments on Association Rules

association rule RX ? Y s,c
Strictly speaking these are not rules
i.e., we could have wine cheese and
cheese wine
correlation is not causation
The space of all possible rules is enormous
O( 2p ) where p the number of different items
Will need some form of combinatorial search
algorithm
How are thresholds minsup and minconf selected?
Not that easy to know ahead of time how to select
these

45
Example

simple example transaction database (T4)
Transaction1 A,B,C
Transaction2 A,C
Transaction3 A,D
Transaction4 B,E,F
with minsup50, minconf50
R1 A -- C s50, c66.6
s(R1) s(A,C) , c(R1) s(A,C)/s(A) 2/3
R2 C -- A s50, c100
s(R2) s(A,C), c(R2) s(A,C)/s(C) 2/2

s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
46
Finding Association Rules

two steps
step 1 find all frequent itemsets (F)
F Z s(Z) ? minsup (e.g.
Za,b,c,d,e)
step 2 find all rules R X -- Y such that
X ? Y ? F and X ? Y? (e.g. R
a,b,c -- d,e)
s(R) ? minsup and c(R) ? minconf
step 1s time-complexity typically step 2s
step 2 need not scan the data (s(X),s(Y) all
cached in step 1)
search space is exponential in I, filters
choices for step 2
so, most work focuses on fast frequent itemset
generation
step 1 never filters viable candidates for step 2

47
Finding Frequent Itemsets

frequent itemsets Z s(Z)minsup
Apriori (monotonicity) Principle s(X) ? s(X?Y)
any subset of a frequent itemset must be frequent
finding frequent itemsets
bottom-up approach
do level-wise, for k1 I
k1 find frequent singletons
k2 find frequent pairs (often most costly)
step k.1 find size-k itemset candidates from the
freq size-(k-1)s of prev level
step k.2 prune candidates Z for which s(Z)
each level requires a single scan over all the
transaction data
computes support counts ?(Z) t t ?T, Z ?
t for all size-k Z candidates

s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
48
Apriori Example (minsup2)
bottleneck
itemset 1,2 1,3 1,5 2,3 2,5 3,5
C2
F1
C1
transactions T 1,3,4 2,3,5 1,2,3,5 2,5
itemset sup 1 2 2 3 3 3 4 1 5 3
itemset sup 1 2 2 3 3 3 5 3
gen
count (scan T)
filter
count (scan T)
F3
itemset sup 2,3,5 2
C2
C3 knows can avoid gen 1,2,3 (and 1,3,5)
apriori, without counting, because 1,2 (1,5)
not freq
itemset sup 1,2 1 1,3 2 1,5 1 2,3 2
2,5 3 3,5 2
F2
filter
itemset sup 1,3 2 2,3 2 2,5 3 3,5 2
C3
itemset sup 2,3,5 2
notice how C3 C3
filter
itemset 2,3,5
count (scan T)
gen
49
Problems with Association Rules

Consider 4 highly correlated items A, B, C, D
Say p(subset isubset j) minconf for all
possible pairs of disjoint subsets
And p(subset i ? subset j) minsup
How many possible rules?
E.g., A-B, A,BC, A,CB, B,CA
All possible combinations 4 x 23
In general for K such items, K x 2K-1 rules
For highly correlated items there is a
combinatorial explosion of redundant rules
In practice this makes interpretation of
association rule results difficult

50
Computation Model

Typically, data is kept in a flat file rather
than a database system.
Stored on disk.
Stored basket-by-basket.
Expand baskets into pairs, triples, etc. as you
read baskets.

51
File Organization
Item
Item
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
52
Computation Model --- (2)

The true cost of mining disk-resident data is
usually the number of disk I/Os.
In practice, association-rule algorithms read the
data in passes --- all baskets read in turn.
Thus, we measure the cost by the number of passes
an algorithm takes.

53
Main-Memory Bottleneck

For many frequent-itemset algorithms, main memory
is the critical resource.
As we read baskets, we need to count something,
e.g., occurrences of pairs.
The number of different things we can count is
limited by main memory.
Swapping counts in/out is a disaster.

54
Finding Frequent Pairs

The hardest problem often turns out to be finding
the frequent pairs.
Well concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.

55
Naïve Algorithm

Read file once, counting in main memory the
occurrences of each pair.
Expand each basket of n items into its
n (n -1)/2 pairs.
Fails if (items)2 exceeds main memory.
Remember items can be 100K (Wal-Mart) or 10B
(Web pages).

56
Details of Main-Memory Counting

Two approaches
Count all item pairs, using a triangular matrix.
Keep a table of triples i, j, c the count of
the pair of items i,j is c.
(1) requires only (say) 4 bytes/pair.
(2) requires 12 bytes, but only for those pairs
with count 0.

57
4 per pair
12 per occurring pair
Method (1)
Method (2)
58
Details of Approach 1

Number items 1, 2,
Keep pairs in the order 1,2, 1,3,, 1,n ,
2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n
.
Find pair i, j at the position
(i 1)(n i /2) j i.
Total number of pairs n (n 1)/2 total bytes
about 2n 2.

59
Details of Approach 2

You need a hash table, with i and j as the key,
to locate (i, j, c) triples efficiently.
Typically, the cost of the hash structure can be
neglected.
Total bytes used is about 12p, where p is the
number of pairs that actually occur.
Beats triangular matrix if at most 1/3 of
possible pairs actually occur.

60
A-Priori Algorithm --- (1)

A two-pass approach called a-priori limits the
need for main memory.
Key idea monotonicity if a set of items
appears at least s times, so does every subset.
Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.

61
A-Priori Algorithm --- (2)

Pass 1 Read baskets and count in main memory the
occurrences of each item.
Requires only memory proportional to items.
Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to be frequent.
Requires memory proportional to square of
frequent items only.

62
Picture of A-Priori
Item counts
Frequent items
Counts of candidate pairs
Pass 1
Pass 2
63
Detail for A-Priori

You can use the triangular matrix method with n
number of frequent items.
Saves space compared with storing triples.
Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.

64
Frequent Triples, Etc.

For each k, we construct two sets of k
tuples
Ck candidate k tuples those that might be
frequent sets (support s ) based on information
from the pass for k 1.
Lk the set of truly frequent k tuples.

65
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
66
A-Priori for All Frequent Itemsets

One pass for each k.
Needs room in main memory to count each candidate
k tuple.
For typical market-basket data and reasonable
support (e.g., 1), k 2 requires the most
memory.

67
Frequent Itemsets --- (2)