Incremental and OnLine Data Mining - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Incremental and OnLine Data Mining

Description:

Insertion. T. P. Hong. 13. Two Problems. Problem 1: Are large itemsets still large? e.g.: Minsup=50 ... To obtain a safety bound of insertions t ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 79
Provided by: cat972
Category:

less

Transcript and Presenter's Notes

Title: Incremental and OnLine Data Mining


1
Incremental and On-Line Data Mining
  • Prof. Tzung-Pei Hong
  • Department of Electrical Engineering National
    University of Kaohsiung

2
Outline
  • Introduction
  • Apriori Mining Algorithm
  • Incremental Mining
  • FUP algorithm
  • Negative border algorithm
  • Pre-large itemset algorithm

3
Outline
  • Multi-dimensional Online Mining
  • Knowledge Warehouse
  • Three-phase Online Association Rule Mining
    (TOARM)
  • Negative-Border Online Mining (NOM)
  • Lattice-based NOM (LNOM)
  • Conclusion

4
Why Data Mining
How to arrange goods into supermarket?
5
Association Rules
IF bread is bought then milk is bought
6
The Role of Data Mining
Useful patterns
Transaction data
Data Mining
Knowledge and strategy
Preprocess data
7
Different Kinds of Knowledge
  • Association rules
  • Generalized association rules
  • Sequential patterns
  • Quantitative association rules
  • Classification rules
  • Clustering rules
  • etc

8
Association Rules
IF bread is bought then milk is bought
9
Apriori Algorithm
  • Proposed by Agrawal et al.
  • Step1 Define minsup and minconf
  • ex minsup50
  • minconf50
  • Step2 Find large itemsets
  • Step3 Generate association rules

10
Example
Large itemsets
Scan Database
L
1
Itemset
Sup.
A
2
B
3
C
3
E
3
Scan Database
Scan Database
11
Example

12
Maintenance of Association Rules
  • Insertion of new transactions

New transactions
Original database
?
13
Two Problems
  • Problem 1
  • Are large itemsets still large?
  • e.g. Minsup50


As count2
As count0
New transactions (2 records)
Original database (4 records)
lt (42)503 (Minimum count)
14
Two Problems
  • Problem 2
  • Are small itemsets still small?
  • e.g. Minsup50


Ds count1
Ds count2
New transactions (2 records)
Original database (4 records)
gt (42)503 (Minimum count)
15
Maintenance
  • Intuition Re-mining for each data insertion

Updated database
Re-mining
1. Waste discovered information 2. Spend
computational time
16
Related Research
  • FUP algorithm
  • Negative border algorithms
  • Pre-large itemset algorithm

17
FUP Algorithm
  • Cheung et al. 1995
  • For efficient incremental data mining
  • Step 1 Record all large itemsets

18
Four Cases in FUP
19
FUP Algorithm
  • Step 2 Process new transactions

Originally large itemsets
Re-calculating the updated item counts
Large itemsets in new transactions?
Originally small itemsets
Small itemsets in new transactions
Reduce rescanning database
20
Case 3 in FUP
  • Original Small
  • New Large
  • Rescan the original database

Re-scanning original database
21
Negative Border
  • The candidate itemsets without enough supports
  • e.g. Negative 1-itemsets

?

22
Negative Border
  • e.g. Negative 2-itemsets

?
  • Negative border all negative itemsets

23
Negative Border Algorithm
  • Pre-store all candidate itemsets
  • Large itemsets
  • Negative itemsets

24
Negative Border Algorithm
  • Process candidates from new transactions
  • Three cases
  • Originally large itemset
  • Originally Negative Itemset
  • Calculate the counts directly
  • e.g. D
  • Others
  • Rescan
  • For Case 3 in FUP
  • Reduce rescanning probability

25
Pre-Large Itemsets
  • For efficiently handle Case 3 in FUP

26
Pre-Large Itemset
  • Not truly large
  • Acting like buffer
  • Reduing the movement from large directly to small
    or vice verse.
  • Lying between the lower and upper thresholds

Area of pre-large itemsets
Area of small itemsets
Area of large itemsets
27
Pre-Large Algorithm
  • Record all large and pre-large itemsets in the
    original database
  • Nine cases are generated

28
Results of Nine Cases
29
Safety Bound of Insertions
  • For Case 7
  • Small itemset in the original database,
  • but large itemset in the new transactions
  • To obtain a safety bound of insertions t
  • If the number of new insertions gt t, then rescan
    the original database
  • Otherwise, do nothing

Large itemsets area
Small itemsets area
Upper threshold
Lower threshold
30
Theoretical Foundation
  • Let
  • Su Upper support threshold
  • Sl Low support threshold
  • d Transaction number in the original database
  • t Allowed insertion number for not rescanning
  • the original database
  • Theorem
  • if
  • then no rescan!

31
Example
  • Let
  • Su 60
  • Sl 50
  • d 100
  • t

Safety bound t25
32
Proof
Is count of updated database
Is count of original database
Is count of new inserted transactions
33
Safety Bound of Lower Threshold
  • To obtain a safety bound of threshold Sl
  • Given fixed insertions t
  • Derive the corresponding Sl
  • If new insertion number gt t,
  • then rescan the original database
  • and re-adjust the Sl
  • Otherwise, do nothing

34
Theoretical Foundation
  • Theorem
  • if
  • then no rescan!
  • Corollary
  • if
  • then no rescan

35
Example
  • Let
  • Su 60
  • d 100
  • t 10
  • Sl

Safety bound Sl0.56
36
Example
  • Let
  • Su 60
  • d 100
  • t 10

37
Some Advantages
  • From
  • When the database grows larger,
  • ?The allowed number of new insertions
  • is also larger ?

38
Some Advantages
  • From
  • When the database grows larger,
  • ?The allowed lower threshold Sl
  • is closer to Su

39
Diagram

Re-calculating the updated item counts
lt safety bound
Originally small itemsets
gt safety bound
40
Proposed Algorithm
  • Step1 record all originally large and pre-large
  • itemsets
  • Step2 calculate the safety bound
  • Step3 scan the new transactions
  • Step4 divide candidates in new transactions
  • into three parts
  • a. original large itemsets
  • b. original pre-large itemsets
  • c. small itemsets
  • Step5 check whether the updated itemsets are
  • large, pre-large, or small?
  • Step6 if new transactions lt safety bound,
  • do nothing,
  • otherwise, rescan the original
    database

41
Experiments
42
Experiments
43
Experiments
44
Other Problems
  • Deletion of old records

Original database
?
A,B,C,E BC,BE,CE BCE
Deletion
45
Other Problems
  • Modification of old records

Original database
?
A,B,C,E BC,BE,CE BCE
Modification
46
Other Problems
  • Sequential Patterns
  • e.g.
  • Customer 1 1000 BCD
  • Customer 2 1100 BC
  • Customer 1 1500 BCDE
  • Customer 2 1700 EF
  • Customer 1 BCD -gt BCDE
  • Customer 2 BC -gt EF
  • gt BC -gt E

47
Summary
  • Pre-large concept
  • Efficient incremental mining algorithms
  • Retain features of FUP
  • Avoiding re-computing large itemsets
  • Focusing on newly inserted transactions
  • Filtering the candidate itemsets
  • Become increasingly efficient along with growth
    of database

48
Motivation for Online Mining
  • Online Data mining
  • Iterative
  • Interactive
  • Changing requirement
  • e.g. varying threshold (e.g. minsup)
  • e.g. changing data over time

49
Why Multi-Dimensional Online Mining

SimonBike company
Los Angeles branch
San Francisco branch
New York branch
50
Examples
  • From Los Angeles and San Francisco in the first
    quarters of the latest five years
  • The minimum support increasing from 5 to 10
  • Popular combinations of products sold in July
    last year

51
Multidimensional Online Mining
Classic mining framework
Multidimensional online mining framework
52
Framework
Multidimensional pattern relation
  • Mining

Underlying database or data warehouse
  • Multidimensional
  • online mining


2003/12 Los Angeles

2003/1
2003/10
MiningQuery 1
New York
Los Angeles


2003/2
2003/11
New block of data
Mining Query 2
New York
Los Angeles
? Data Insertion


2003/3
2003/12
Los Angeles
New York
53
Framework
  • Assumption
  • Data evolving in a systematic way
  • Inserted or deleted in a block during a time
    interval
  • e.g. midnight of a day
  • e.g. a month
  • Mining Knowledge Warehouse
  • Storing mined information systematically and
    structurally
  • Multidimensional pattern relation (MPR)

54
Multidimensional Pattern Relation
Assume Initial minimum support 5

Context (Circumstances)
Content (Mining Information)
55
Multidimensional Online Mining
  • Three-phase Online Association Rule Mining
    (TOARM)
  • Based on MPR
  • Three phases
  • Generation of candidate itemsets
  • Pruning of candidate itemsets
  • Generation of large itemsets

56
Phase 1
  • Generation of candidate itemsets satisfying a
    mining request
  • e.g. Find large itemsets satisfying minsup 5.5
    from data collected from CA in 1998/11 to 1998/12
  • Selecting tuples satisfying contexts

v
v
v
v
57
Phase 1
  • Generating candidates from these matched tuples
  • Lemma For each candidate itemset x, there exists
    at least a matched tuple such that its support
    satisfies query support

Candidates A, B, C, AB
58
Phase 2
  • Pruning of candidate itemsets
  • Calculating upper-bound supports of candidates

gt Prune A
59
Phase 2
gt Send B to Phase 3
60
Phase 2
gt Keep C in large itemsets
gt Remove AB
0.053
61
Phase 3
  • Generation of Large Itemsets
  • Rescanning the blocks of candidate B for tuple
    3 and 5
  • Obtaining their physical supports

62
Multi-Dimensional Online Mining
  • Helping on-line decision support
  • From only interested portion of data
  • From different aspects
  • e.g. time, location, threshold

63
Extended MPR (EMPR)
  • Keeping additional negative-border information

e.g. Initial minimum support 5
64
Negative-Border Online Mining (NOM)
  • Based on EMPR
  • TOARM ? NOM
  • Appearing Count
  • Frequent pattern
  • Negative pattern
  • Not Appearing Count
  • TOARM -gt ns 1
  • NOM -gt min(ns -1, nmin(sx))
  • x subset of x

?
65
Example
  • NOM -gt min(ns -1, nmin(sx))
  • x subset of x
  • Example

AC Candidate tuple 1 n17 tuple 2 n2min(5,
2) n22
?
66
Lattice-based NOM (LNOM)
  • Implementing NOM more efficiently
  • Using the lattice data structure
  • Using hashing techniques
  • spanning-tree-count-calculating (STCC) algorithm

67
Experiments
  • In Java
  • On a workstation
  • with dual XEON (2.8GHz) processors
  • with 2G main memory.
  • RedHat 9.0
  • Dataset
  • Several synthetic datasets (IBM data generator)
  • A real-world dataset, BMS-POS (KDDCUP 2000)

68
Notation
69
Data Sets
70
Experiment TOARM
  • Performance comparison synthetic datasets

71
Experiment TOARM
  • Performance comparison BMS-POS dataset

72
Experiment TOARM
  • Scalability comparison

73
Experiment TOARM vs. NOM
  • Performance comparison synthetic datasets

74
Experiment TOARM vs. NOM
  • Execution time in Phase 3
  • NOM lt TOARM
  • Execution time in Phases 1 and 2
  • NOM gt TOARM
  • NOM over TOARM
  • Small item set
  • Large data size

75
Experiment NOM vs. LNOM
  • Performance comparison synthetic datasets

76
Experiment NOM vs. LNOM
  • Performance comparison BMS-POS dataset

77
Summary
  • Multi-dimensional online mining
  • Effectively utilizing patterns previously mined
  • Knowledge warehouse
  • Multidimensional pattern relation
  • Extended multidimensional Pattern Relation
  • Mining approaches
  • TOARM
  • NOM
  • LNOM
  • Experimental results
  • Showing good performance

78
Thank You
Write a Comment
User Comments (0)
About PowerShow.com