David Newman, UC Irvine Lecture 17: Pattern Finding 1 - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

David Newman, UC Irvine Lecture 17: Pattern Finding 1

Description:

Final project report due Thurs Dec 13th. please bring printed copy to class and email me a copy ... associations: Trader Joe's customers frequently buy wine & cheese ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 68
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 17: Pattern Finding 1


1
CS 277 Data MiningLecture 17 Pattern
Discovery Algorithms
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2
Notices
  • Presentations (10 minute)
  • Thurs Dec 6th (5 students) usual time 5pm
  • Thurs Dec 13th (5 students) Note time 4pm
  • Volunteers for presenting Dec 6th? Otherwise
    random
  • Final project report due Thurs Dec 13th
  • please bring printed copy to class and email me a
    copy
  • Will give instructions for presentations and
    final report

3
Project Presentations
  • Thursday following two weeks, each student will
    make an in-class 10-minute presentation on their
    project (with 1 or 2 minutes for questions)
  • Email me your PowerPoint or PDF slides, with your
    name (e.g., joesmith.ppt), before 3pm on the day
    you are presenting
  • Suggested content
  • Definition of the task/goal
  • Description of data sets
  • Description of algorithms
  • Experimental results and conclusions
  • Be visual where possible! (use figures, graphs,
    etc)

4
Final Project Reports
  • Must be submitted as an email attachment (PDF,
    Word, etc) by 3pm on Thursday Dec 13
  • Use ICS 278 final project report in the subject
    line of your email
  • Report should be self-contained
  • Like a short technical paper
  • A reader should be able to repeat your results
  • Include details in appendices if necessary
  • Approximately 1 page of text per section (see
    next slide)
  • graphs/plots dont count include as many of
    these as you like.
  • Can re-use material from proposal and from
    midterm progress report if you wish

5
Suggested Outline of Final Project Report
  • Introduction
  • Clear description of task/goals of the project
  • Motivation why is this problem interesting
    and/or important?
  • Discussion of relevant literature
  • Summarize relevant aspects of prior
    published/related work
  • Technical approach
  • Data used in your project
  • Exploratory data analysis relevant to your task
  • Include as many of plots/graphs as you think are
    useful/relevant
  • Algorithms used in your project
  • Clear description of all algorithms used
  • Credit appropriate sources if you used other
    implementations
  • Experimental Results
  • Clear description of your experimental
    methodology
  • Detailed description of your results (graphs,
    tables, etc)

6
Homework 3 review
  • K-means
  • SVD (LSI)
  • NMF
  • PLSI

7
K-means
  • Classic, Euclidean version
  • Euclidean objective Q
  • E-Step (Expectation)
  • rdk s.t. Q minimized
  • M-Step (Maximization)

8
K-means
  • Cosine similarity version
  • Correct update for objective Q?
  • E-Step
  • rdk s.t. Q maximized
  • M-Step

???
9
K-means
  • Iterations till convergence
  • 20, 1, 5, 15, 12, 10, 5-15
  • Do reality check on topics printed
  • You have 4 methods to find topics
  • You should have some prior expectation that the
    different methods will produce similar topics
  • therefore if you have a method that produces
    strange topics, take a second look

10
SVD
  • W,D size(X)
  • U,S,V svds(X,K)
  • for k1K
  • xsort,isort sort(-abs(U(,k)))
  • fprintf('vector d ', k)
  • for i112
  • fprintf('s ', wordisort(i))
  • end
  • fprintf('\n')
  • end

11
SVD Time complexity
  • Dense SVD of D x W matrix X
  • Time min DW2, WD2
  • Google matlab svd
  • svd uses LAPACK DGESVD
  • Time complexity less if
  • Sparse
  • Computing K-leading singular values / singular
    vectors

12
Matlab svd, svds
  • svd
  • uses LAPACK DGESVD
  • svds
  • uses eigs to find eigenvalues/eigenvectors of
  • eigs uses ARPACK (Arnoldi Package)
  • ARPACK requires (routine for) y Ax
  • y A x
  • dense O(n2 )
  • sparse O(n)
  • ? Estimate work for svds

13
Space Complexity
  • Size parameters
  • D documents
  • W words
  • N total words, M nonzero entries (M N/2)
  • L average words per document (L N/D)
  • K topics
  • Data many of you said space DW
  • PubMed D 107, W105 4 Byte integers
  • need 4 1012 4,000 GB memory
  • memory, not disk
  • this is exceptionally large
  • DONT IGNORE SPARSENESS

14
Space Complexity
  • D documents
  • W words
  • N total words, M nonzero entries (M N/2)
  • L average words per document
  • K topics

Parameters (typically) topics K W mixes D
K total K (DW)
Data N D L e D W
Assumptions K
Data Parameters D (L K)
15
Complexity comparison
16
Complexity comparison
no free lunch
17
PLSI
  • Some of you did
  • prob_d_w zeros(D,W)
  • this is not scalable
  • D.W likely to be bigger than K.D.L
  • Space p(zw,d)
  • K D L (largest of all!!!)
  • for k1K
  • prob_z_given_w_dk sparse(W,D)
  • end

18
Complexity of operations on n x n matrix A
  • Dense A
  • y A x
  • x inv(A) y
  • A x l x
  • det(A)
  • Sparse A
  • y A x
  • x inv(A) y
  • A x l x
  • det(A)

19
  • Lecture
  • Pattern Discovery Algorithms

20
Pattern-Based Algorithms
  • Global predictive and descriptive modeling
  • global models in the sense that they cover all of
    the data space
  • Patterns
  • More local structure, only describe certain
    aspects of the data
  • Examples
  • A single small very dense cluster in input space
  • e.g., a new type of galaxy in astronomy data
  • An unusual set of outliers
  • e.g., indications of an anomalous event in
    time-series climate data
  • Associations or rules
  • If bread is purchased and wine is purchased then
    cheese is purchased with probability p
  • Motif-finding in sequences, e.g.,
  • motifs in DNA sequences noisy words in random
    background

21
General Ideas for Patterns
  • Many patterns can be described in the general
    form
  • if condition 1 then condition 2 (with some
    certainty)
  • Probabilistic rules If Age 40 and
    education college then income 50k with
    probability p
  • Bumps
  • If Age 40 and education college then
    mean income 73k
  • if antecedent then consequent
  • if j then v
  • where j is generally some box in the input space
  • where v is a statement about a variable of
    interest, e.g., p(y j ) or E y j
  • Pattern support
  • Support p( j ) or p(j , w )
  • Fraction of points in input space where the
    condition applies
  • Often interested in patterns with larger support

22
How Interesting is a Pattern?
  • Note interestingness is inherently subjective
  • Depends on what the data analyst already knows
  • Difficult to quantify prior knowledge
  • How interesting a pattern is, can be a function
    of
  • How surprising it is relative to prior knowledge?
  • How useful (actionable) it is?
  • This is a somewhat open research problem
  • In general pattern interestingness is difficult
    to quantify
  • ? Use simple surrogate measures in practice

23
How Interesting is a Pattern?
  • Interestingness of a pattern
  • Measures how interesting the pattern j - v is
  • Typical measures of interest
  • Conditional probability p( v j )
  • Change in probability p( v j ) - p( v
    )
  • Lift p( v j ) / p( v ) (also log
    of this)
  • Change in mean target response, e.g., E y j
    /Ey

24
Pattern-Finding Algorithms
  • Typically search a data set for the set of
    patterns that maximize some score function
  • Usually a function of both support and
    interestingness
  • E.g.,
  • Association rules
  • Bump-hunting
  • Issues
  • Huge combinatorial search space
  • How many patterns to return to the user
  • How to avoid problems with redundant patterns
  • Statistical issues
  • Even in random noise, if we search over a very
    large number of patterns, we are likely to find
    something that looks significant
  • This is known as multiple hypothesis testing in
    statistics
  • One approach that can help is to conduct
    randomization tests
  • e.g., for matrix data randomly permute the values
    in each column
  • Run pattern-discovery algorithm resulting
    scores provide a null distribution
  • Ideally, also need a 2nd data set to validate
    patterns

25
Transaction Data and Market Baskets
x
x
x
x
x
x
x
  • Supermarket example (Srikant and Agrawal, 1997)
  • items 50,000, transactions 1.5 million
  • Data sets are typically very sparse

26
Market Basket Analysis
  • given a (huge) transactions database
  • each transaction representing basket for 1
    customer visit
  • each transaction containing set of items
    (itemset)
  • finite set of (boolean) items (e.g. wine, cheese,
    diaper, beer, )
  • Association rules
  • classically used on supermarket transaction
    databases
  • associations Trader Joes customers frequently
    buy wine cheese
  • rule people who buy wine also buy cheese 60
    of time
  • infamous beer diapers example
  • in evening hours, beer and diapers often
    purchased together
  • generalize to many other problems, e.g.
  • baskets documents, items words
  • baskets WWW pages, items links

27
Market Basket Analysis Complexity
  • usually transaction DB too huge to fit in RAM
  • common sizes
  • number of transactions 105 to 108 (hundreds
    of millions)
  • number of items 102 to 106
    (hundreds to millions)
  • entire DB needs to be examined
  • usually very sparse
  • e.g. 0.1 chance of buying random item
  • subsampling often a useful trick in DM, but
  • here, subsampling could easily miss the (rare)
    interesting patterns
  • thus, runtime dominated by disk read times
  • motivates focus on minimizing number of disk scans

28
Association Rules
From Ullmans data mining lectures http//infolab
.stanford.edu/ullman/mining/2006/index.html
  • Market Baskets
  • Frequent Itemsets
  • A-priori Algorithm

29
The Market-Basket Model
  • A large set of items, e.g., things sold in a
    supermarket.
  • A large set of baskets, each of which is a small
    set of the items, e.g., the things one customer
    buys on one day.

30
Support
  • Simplest question find sets of items that appear
    frequently in the baskets.
  • Support for itemset I the number of baskets
    containing all items in I.
  • Given a support threshold s, sets of items that
    appear in s baskets are called frequent
    itemsets.

31
Example
  • Itemsmilk, coke, pepsi, beer, juice.
  • Support 3 baskets.
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • Frequent itemsets m, c, b, j, m,
    b, c, b, j, c.

32
Applications --- (1)
  • Real market baskets chain stores keep terabytes
    of information about what customers buy together.
  • Tells how typical customers navigate stores, lets
    them position tempting items.
  • Suggests tie-in tricks, e.g., run sale on
    diapers and raise the price of beer.
  • High support needed, or no s .

33
Applications --- (2)
  • Baskets documents items words in those
    documents.
  • Lets us find words that appear together unusually
    frequently, i.e., linked concepts.
  • Baskets sentences, items documents
    containing those sentences.
  • Items that appear together too often could
    represent plagiarism.

34
Applications --- (3)
  • Baskets Web pages items linked pages.
  • Pairs of pages with many common references may be
    about the same topic.
  • Baskets Web pages p items pages that
    link to p .
  • Pages with many of the same links may be mirrors
    or about the same topic.

35
Important Point
  • Market Baskets is an abstraction that models
    any many-many relationship between two concepts
    items and baskets.
  • Items need not be contained in baskets.
  • The only difference is that we count
    co-occurrences of items related to a basket, not
    vice-versa.

36
Scale of Problem
  • WalMart sells 100,000 items and can store
    billions of baskets.
  • The Web has over 100,000,000 words and billions
    of pages.

37
Association Rules
  • If-then rules about the contents of baskets.
  • i1, i2,,ik ? j means if a basket contains
    all of i1,,ik then it is likely to contain j.
  • Confidence of this association rule is the
    probability of j given i1,,ik.

38
Example
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • An association rule m, b ? c.
  • Confidence 2/4 50.

_ _

39
Interest
  • The interest of an association rule X ? Y
    is the absolute value of the amount by which the
    confidence differs from the probability of Y.

40
Example
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • For association rule m, b ? c, item c appears
    in 5/8 of the baskets.
  • Interest 2/4 - 5/8 1/8 --- not very
    interesting.

41
Relationships Among Measures
  • Rules with high support and confidence may be
    useful even if they are not interesting.
  • We dont care if buying bread causes people to
    buy milk, or whether simply a lot of people buy
    both bread and milk.
  • But high interest suggests a cause that might be
    worth investigating.

42
Finding Association Rules
  • A typical question find all association rules
    with support s and confidence c.
  • Note support of an association rule is the
    support of the set of items it mentions.
  • Hard part finding the high-support (frequent )
    itemsets.
  • Checking the confidence of association rules
    involving those sets is relatively easy.

43
Association Rules Problem Definition
  • given set I of items, set T transactions, ?t ?T,
    t ? I
  • Itemset Z a set of items (any subset of I)
  • support count ?(Z) num transactions containing
    Z
  • given any itemset Z ? I, ?(Z) t t ?T, Z
    ? t
  • association rule
  • RX ? Y s,c, X,Y ? I, X?Y?
  • support
  • s(R) s(X?Y) ?(X?Y)/T p(X?Y)
  • confidence
  • c(R) s(X?Y) / s(X) ?(X?Y) / ?(X) p(X Y)
  • goal find all R such that
  • s(R) ? given minsup
  • c(R) ? given minconf

44
Comments on Association Rules
  • association rule RX ? Y s,c
  • Strictly speaking these are not rules
  • i.e., we could have wine cheese and
    cheese wine
  • correlation is not causation
  • The space of all possible rules is enormous
  • O( 2p ) where p the number of different items
  • Will need some form of combinatorial search
    algorithm
  • How are thresholds minsup and minconf selected?
  • Not that easy to know ahead of time how to select
    these

45
Example
  • simple example transaction database (T4)
  • Transaction1 A,B,C
  • Transaction2 A,C
  • Transaction3 A,D
  • Transaction4 B,E,F
  • with minsup50, minconf50
  • R1 A -- C s50, c66.6
  • s(R1) s(A,C) , c(R1) s(A,C)/s(A) 2/3
  • R2 C -- A s50, c100
  • s(R2) s(A,C), c(R2) s(A,C)/s(C) 2/2

s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
46
Finding Association Rules
  • two steps
  • step 1 find all frequent itemsets (F)
  • F Z s(Z) ? minsup (e.g.
    Za,b,c,d,e)
  • step 2 find all rules R X -- Y such that
  • X ? Y ? F and X ? Y? (e.g. R
    a,b,c -- d,e)
  • s(R) ? minsup and c(R) ? minconf
  • step 1s time-complexity typically step 2s
  • step 2 need not scan the data (s(X),s(Y) all
    cached in step 1)
  • search space is exponential in I, filters
    choices for step 2
  • so, most work focuses on fast frequent itemset
    generation
  • step 1 never filters viable candidates for step 2

47
Finding Frequent Itemsets
  • frequent itemsets Z s(Z)minsup
  • Apriori (monotonicity) Principle s(X) ? s(X?Y)
  • any subset of a frequent itemset must be frequent
  • finding frequent itemsets
  • bottom-up approach
  • do level-wise, for k1 I
  • k1 find frequent singletons
  • k2 find frequent pairs (often most costly)
  • step k.1 find size-k itemset candidates from the
    freq size-(k-1)s of prev level
  • step k.2 prune candidates Z for which s(Z)
  • each level requires a single scan over all the
    transaction data
  • computes support counts ?(Z) t t ?T, Z ?
    t for all size-k Z candidates

s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
48
Apriori Example (minsup2)
bottleneck
itemset 1,2 1,3 1,5 2,3 2,5 3,5
C2
F1
C1
transactions T 1,3,4 2,3,5 1,2,3,5 2,5
itemset sup 1 2 2 3 3 3 4 1 5 3
itemset sup 1 2 2 3 3 3 5 3
gen
count (scan T)
filter
count (scan T)
F3
itemset sup 2,3,5 2
C2
C3 knows can avoid gen 1,2,3 (and 1,3,5)
apriori, without counting, because 1,2 (1,5)
not freq
itemset sup 1,2 1 1,3 2 1,5 1 2,3 2
2,5 3 3,5 2
F2
filter
itemset sup 1,3 2 2,3 2 2,5 3 3,5 2
C3
itemset sup 2,3,5 2
notice how C3 C3
filter
itemset 2,3,5
count (scan T)
gen
49
Problems with Association Rules
  • Consider 4 highly correlated items A, B, C, D
  • Say p(subset isubset j) minconf for all
    possible pairs of disjoint subsets
  • And p(subset i ? subset j) minsup
  • How many possible rules?
  • E.g., A-B, A,BC, A,CB, B,CA
  • All possible combinations 4 x 23
  • In general for K such items, K x 2K-1 rules
  • For highly correlated items there is a
    combinatorial explosion of redundant rules
  • In practice this makes interpretation of
    association rule results difficult

50
Computation Model
  • Typically, data is kept in a flat file rather
    than a database system.
  • Stored on disk.
  • Stored basket-by-basket.
  • Expand baskets into pairs, triples, etc. as you
    read baskets.

51
File Organization
Item
Item
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
52
Computation Model --- (2)
  • The true cost of mining disk-resident data is
    usually the number of disk I/Os.
  • In practice, association-rule algorithms read the
    data in passes --- all baskets read in turn.
  • Thus, we measure the cost by the number of passes
    an algorithm takes.

53
Main-Memory Bottleneck
  • For many frequent-itemset algorithms, main memory
    is the critical resource.
  • As we read baskets, we need to count something,
    e.g., occurrences of pairs.
  • The number of different things we can count is
    limited by main memory.
  • Swapping counts in/out is a disaster.

54
Finding Frequent Pairs
  • The hardest problem often turns out to be finding
    the frequent pairs.
  • Well concentrate on how to do that, then discuss
    extensions to finding frequent triples, etc.

55
Naïve Algorithm
  • Read file once, counting in main memory the
    occurrences of each pair.
  • Expand each basket of n items into its
    n (n -1)/2 pairs.
  • Fails if (items)2 exceeds main memory.
  • Remember items can be 100K (Wal-Mart) or 10B
    (Web pages).

56
Details of Main-Memory Counting
  • Two approaches
  • Count all item pairs, using a triangular matrix.
  • Keep a table of triples i, j, c the count of
    the pair of items i,j is c.
  • (1) requires only (say) 4 bytes/pair.
  • (2) requires 12 bytes, but only for those pairs
    with count 0.

57
4 per pair
12 per occurring pair
Method (1)
Method (2)
58
Details of Approach 1
  • Number items 1, 2,
  • Keep pairs in the order 1,2, 1,3,, 1,n ,
    2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n
    .
  • Find pair i, j at the position
    (i 1)(n i /2) j i.
  • Total number of pairs n (n 1)/2 total bytes
    about 2n 2.

59
Details of Approach 2
  • You need a hash table, with i and j as the key,
    to locate (i, j, c) triples efficiently.
  • Typically, the cost of the hash structure can be
    neglected.
  • Total bytes used is about 12p, where p is the
    number of pairs that actually occur.
  • Beats triangular matrix if at most 1/3 of
    possible pairs actually occur.

60
A-Priori Algorithm --- (1)
  • A two-pass approach called a-priori limits the
    need for main memory.
  • Key idea monotonicity if a set of items
    appears at least s times, so does every subset.
  • Contrapositive for pairs if item i does not
    appear in s baskets, then no pair including i
    can appear in s baskets.

61
A-Priori Algorithm --- (2)
  • Pass 1 Read baskets and count in main memory the
    occurrences of each item.
  • Requires only memory proportional to items.
  • Pass 2 Read baskets again and count in main
    memory only those pairs both of which were found
    in Pass 1 to be frequent.
  • Requires memory proportional to square of
    frequent items only.

62
Picture of A-Priori
Item counts
Frequent items
Counts of candidate pairs
Pass 1
Pass 2
63
Detail for A-Priori
  • You can use the triangular matrix method with n
    number of frequent items.
  • Saves space compared with storing triples.
  • Trick number frequent items 1,2, and keep a
    table relating new numbers to original item
    numbers.

64
Frequent Triples, Etc.
  • For each k, we construct two sets of k
    tuples
  • Ck candidate k tuples those that might be
    frequent sets (support s ) based on information
    from the pass for k 1.
  • Lk the set of truly frequent k tuples.

65
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
66
A-Priori for All Frequent Itemsets
  • One pass for each k.
  • Needs room in main memory to count each candidate
    k tuple.
  • For typical market-basket data and reasonable
    support (e.g., 1), k 2 requires the most
    memory.

67
Frequent Itemsets --- (2)
  • C1 all items
  • L1 those counted on first pass to be frequent.
  • C2 pairs, both chosen from L1.
  • In general, Ck k tuples, each k 1 of which is
    in Lk-1.
  • Lk members of Ck with support s.
Write a Comment
User Comments (0)
About PowerShow.com