Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

Data Mining Presented By: Kevin Seng Papers Rakesh Agrawal and Ramakrishnan Srikant: Fast algorithms for mining association rules. Mining sequential patterns. – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 64
Provided by: KevinS158
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Presented By
  • Kevin Seng

2
Papers
  • Rakesh Agrawal and Ramakrishnan Srikant
  • Fast algorithms for mining association rules.
  • Mining sequential patterns.

3
Outline
  • For each paper
  • Present the problem.
  • Describe the algorithms.
  • Intuition
  • Design
  • Performance.

4
Market Basket Introduction
  • Retailers are able to collect massive amounts of
    sales data (basket data)
  • Bar-code technology
  • E-commerce
  • Sales data generally includes customer id,
    transaction date and items bought.

5
Market Basket Problem
  • It would be useful to find association rules
    between transactions.
  • ie. 75 of the people who buy spaghetti also by
    tomato sauce.
  • Given a set of basket data, how can we
    efficiently find the set of association rules?

6
Formal Definition (1)
  • L i1,i2, im set of items.
  • Database D is a set of transactions.
  • Transaction T is a set of items such that T ? L.
  • An unique identifier, TID, is associated with
    each transaction.

7
Formal Definition (2)
  • T contains X, a set of some items in L, if X ? T.
  • Association rule, X ? Y
  • X ? T, Y ? T, X ? Y ?
  • Confidence of transactions which contain X
    which also contain Y.
  • Support - of transactions in D which contain X
    ? Y.

8
Formal Definition (3)
  • Given a set of transactions D, we want to
    generate all association rules that have support
    and confidence greater than the user-specified
    minimum support (minsup) and minimum confidence
    (minconf).

9
Problem Decomposition
  • Two sub-problems
  • Find all itemsets that have transaction support
    above minsup.
  • These itemsets are called large itemsets.
  • From all the large itemsets, generate the set of
    association rules that have confidence about
    minconf.

10
Second Sub-problem
  • Straightforward approach
  • For every large itemset l, find all non-empty
    subsets of l.
  • For every such subset a, output a rule of the
    form a ? (l a) if ratio of support(l) to
    support(a) is at least minconf.

11
Discovering Large Itemsets
  • Done with multiple passes over the data.
  • First pass, find all individual items that are
    large (have minimum support).
  • Subsequent pass, using large itemsets found in
    previous pass
  • Generate candidate itemsets.
  • Count support for each candidate itemset.
  • Eliminate itemsets that do not have min support.

12
Algorithm
  • L1 large 1-itemsets
  • for( k2 Lk-1?? k) do begin
  • Ck apriori-gen(Lk-1) // New candidates
  • forall transactions t ? D do // Counting support
  • Ct subset(Ck, t) // Candidates in t
  • forall candidates c ? Ct do
  • c.count
  • end
  • Lk c ? Ck c.count ? minsup
  • End
  • Answer ?k Lk

13
Candidate Generation
  • AIS and SETM algorithms
  • Uses the transactions in the database to generate
    new candidates.
  • But this generates a lot of candidates which we
    know beforehand are not large!

14
Apriori Algorithms
  • Generate candidates using only large itemsets
    found in previous pass without considering the
    database.
  • Intuition
  • Any subset of a large itemset must be large.

15
Apriori Candidate Generation
  • Takes in Lk-1 and returns Ck.
  • Two steps
  • Join large itemsets Lk-1 with Lk-1.
  • Prune out all itemsets in joined result which
    contain a (k-1)subset not found in Lk-1.

16
Candidate Generation (Join)
  • insert into Ck
  • select p.item1, p.item2,,p.itemk-1,q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1 q.item1,, p.itemk-2 q.itemk-2,
    p.itemk-1lt q.itemk-1

17
Candidate Gen. (Example)
L3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
C4
1 2 3 4
1 3 4 5
C4
1 2 3 4
Join ?
Prune ?
18
Counting Support
  • Need to count the number of transactions which
    support a given itemset.
  • For efficiency, use a hash-tree.
  • Subset Function

19
Subset Function (Hash-tree)
  • Candidate itemsets are stored in hash-tree.
  • Leaf node contains a list of itemsets.
  • Interior node contains a hash table.
  • Each bucket of the hash table points to another
    node.
  • Root is at depth 1.
  • Interior nodes at depth d points to nodes at
    depth d1.

20
Hash-tree Example (1)
depth
C3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
1
2
1
2 3 4
2
2
3
3
1 2 3 1 2 4
1 3 4 1 3 5
t2
21
Using the hash-tree
  • If we are at a leaf find all itemsets contained
    in transaction.
  • If we are at an interior node hash on each
    remaining element in transaction.
  • Root node hash on all elements in transaction.

22
Hash-tree Example (2)
1
2
D
1 2 3 4
2 3 5
2 3 4
2
3
1 2 3 1 2 4
1 3 4 1 3 5
23
AprioriTid (1)
  • Does not use the transactions in the database for
    counting itemset support.
  • Instead stores transactions as sets of possible
    large itemsets, Ck.
  • Each member of Ck is of the form
  • lt TID, Xkgt , Xk is a possible large itemset

24
AprioriTid (2)
  • Advantage of Ck
  • If a transaction does not contain any candidate
    k-itemset then it will have no entry in Ck.
  • Number of entries in Ck may be less than the
    number of transactions in D.
  • Especially for large k.
  • Speeds up counting!

25
AprioriTid (3)
  • However
  • For small k each entry in Ck may be larger than
    its corresponding transaction.
  • The usual space vs. time.

26
AprioriTid (4) Example
27
Observation
  • When Ck does not fit in main memory we can see
    large jump in execution time.
  • AprioriTid beats Apriori only when Ck can fit in
    main memory.

28
AprioriHybrid
  • It is not necessary to use the same algorithm for
    all the passes.
  • Combine the two algorithms!
  • Start with Apriori
  • When Ck can fit in main memory switch to
    AprioriTID

29
Performance (1)
  • Measured performance by running algorithms on
    generated synthetic data.
  • Used the following parameters

30
Performance (2)
31
Performance (3)
32
Mining Sequential Patterns (1)
  • Sequential patterns are ordered list of itemsets.
  • Market basket example
  • Customers typically rent star wars then empire
    strikes back then return of the Jedi
  • Fitted sheets and pillow cases then comforter
    then drapes and ruffles

33
Mining Sequential Patterns (2)
  • Looks at sequences of transactions as opposed to
    a single transaction.
  • Groups transactions based on customer ID.
  • Customer sequence.

34
Formal Definition (1)
  • Given a database D of customer transactions.
  • Each transaction consists of customer id,
    transaction-time, items purchased.
  • No customer has more than one transaction with
    the same transaction-time.

35
Formal Definition (2)
  • Itemset i, (i1 i2...im) where ij is an item.
  • Sequence s, ?s1s2sn? where sj is an itemset.
  • Sequence ?a1a2an? contained in ?b1b2bn? if
    there exist integers i1lt i2 ... lt in such that
    a1? bi1 , a2? bi2 ,, an? bin .
  • A sequence s is maximal if it is not contained in
    any other sequence.

36
Formal Definition (3)
  • A customer supports a sequence s if s is
    contained in the customer sequence for this
    customer.
  • Support of a sequence - of customers who
    support the sequence.
  • For mining association rules, support was of
    transactions.

37
Formal Definition (4)
  • Given a database D of customer transactions find
    the maximal sequences among all sequences that
    have a certain user-specified minimum support.
  • Sequences that have support above minsup are
    large sequences.

38
Algorithm Sort Phase
  • Customer ID Major key
  • Transaction-time Minor key
  • Converts the original transaction database into a
    database of customer sequences.

39
Algorithm Litemset Phase (1)
  • Litemset Phase
  • Find all large itemsets.
  • Why?
  • Because each itemset in a large sequence has to
    be a large itemset.

40
Algorithm Litemset Phase (2)
  • To get all large itemsets we can use the Apriori
    algorithms discussed earlier.
  • Need to modify support counting.
  • For sequential patterns, support is measured by
    fraction of customers.

41
Algorithm Litemset Phase (3)
  • Each large itemset is then mapped to a set of
    contiguous integers.
  • Used to compare two large itemsets in constant
    time.

42
Algorithm Transformation (1)
  • Need to repeatedly determine which of a given set
    of large sequences are contained in a customer
    sequence.
  • Represent transactions as sets of large itemsets.
  • Customer sequence now becomes a list of sets of
    itemsets.

43
Algorithm Transformation (2)
44
Algorithm Sequence Phase (1)
  • Use the set of large itemsets to find the desired
    sequences.
  • Similar structure to Apriori algorithms used to
    find large itemsets.
  • Use seed set to generate candidate sequences.
  • Count support for each candidate.
  • Eliminate candidate sequences which are not large.

45
Algorithm Sequence Phase (2)
  • Two types of algorithms
  • Count-all counts all large sequences, including
    non-maximal sequences.
  • AprioriAll
  • Count-some try to avoid counting non-maximal
    sequences by counting longer sequences first.
  • AprioriSome
  • DynamicSome

46
Algorithm Maximal Phase (1)
  • Find the maximal sequences among the set of large
    sequences.
  • Set of all large subsequences S

47
Algorithm Maximal Phase (2)
  • Use hash-tree to find all subsequences of sk in
    S.
  • Similar to subset function used in finding large
    itemsets.
  • S is stored in hash-tree.

48
AprioriAll (1)
49
AprioriAll (2)
  • Hash-tree is used for counting.
  • Candidate generation similar to candidate
    generation in finding large itemsets.
  • Except that order matters and therefore we dont
    have the condition
  • p.itemk-1lt q.itemk-1

50
AprioriAll (3)
Example of candidate generation
51
AprioriAll (4)
  • Example

52
Count-some Algorithms
  • Try to avoid counting non-maximal sequences by
    counting longer sequences first.
  • 2 phases
  • Forward Phase find all large sequences or
    certain lengths.
  • Backward Phase find all remaining large
    sequences.

53
AprioriSome (1)
  • Determines which lengths to count using next()
    function.
  • next() takes in as a parameter the length of the
    sequence counted in the last pass.
  • next(k) k 1 - Same as AprioriAll
  • Balances tradeoff between
  • Counting non-maximal sequences
  • Counting extensions of small candidate sequences

54
AprioriSome (2)
  • hitk ?Lk?/ ?Ck?
  • Intuition As hitk increases the time wasted by
    counting extensions of small candidates decreases.

55
AprioriSome (3)
56
AprioriSome (4)
  • Backward Phase
  • For all lengths which we skipped
  • Delete sequences in candidate set which are
    contained in some large sequence.
  • Count remaining candidates and find all sequences
    with min. support.
  • Also delete large sequences found in forward
    phase which are non-maximal.

57
AprioriSome (5)
58
AprioriSome (6)
  • Example

next(k) 2k minsup 2
Forward Phase
C3
3-Sequences

59
AprioriSome (7)
  • Example

Backward Phase
C3
3-Sequences

60
Performance (1)
  • Used generated datasets again
  • Parameters for data

61
Performance (2)
  • DynamicSome generates too many candidates.
  • AprioriSome does a little better than AprioriAll.
  • It avoids counting many non-maximal sequences.

62
Performance (3)
  • Advantage of AprioriSome is reduced for 2
    reasons
  • AprioriSome generates more candidates.
  • Candidates remain memory resident even if skipped
    over.
  • Cannot always follow heuristic.

63
Wrap up
  • Just presented two classic papers on data-mining.
  • Association Rules
  • Sequential Patterns
Write a Comment
User Comments (0)
About PowerShow.com