Title: Incremental and OnLine Data Mining
1Incremental and On-Line Data Mining
- Prof. Tzung-Pei Hong
- Department of Electrical Engineering National
University of Kaohsiung
2Outline
- Introduction
- Apriori Mining Algorithm
- Incremental Mining
- FUP algorithm
- Negative border algorithm
- Pre-large itemset algorithm
3Outline
- Multi-dimensional Online Mining
- Knowledge Warehouse
- Three-phase Online Association Rule Mining
(TOARM) - Negative-Border Online Mining (NOM)
- Lattice-based NOM (LNOM)
- Conclusion
4Why Data Mining
How to arrange goods into supermarket?
5Association Rules
IF bread is bought then milk is bought
6The Role of Data Mining
Useful patterns
Transaction data
Data Mining
Knowledge and strategy
Preprocess data
7Different Kinds of Knowledge
- Association rules
- Generalized association rules
- Sequential patterns
- Quantitative association rules
- Classification rules
- Clustering rules
- etc
8Association Rules
IF bread is bought then milk is bought
9Apriori Algorithm
- Proposed by Agrawal et al.
- Step1 Define minsup and minconf
- ex minsup50
- minconf50
- Step2 Find large itemsets
- Step3 Generate association rules
10Example
Large itemsets
Scan Database
L
1
Itemset
Sup.
A
2
B
3
C
3
E
3
Scan Database
Scan Database
11Example
12Maintenance of Association Rules
- Insertion of new transactions
New transactions
Original database
?
13Two Problems
- Problem 1
- Are large itemsets still large?
- e.g. Minsup50
As count2
As count0
New transactions (2 records)
Original database (4 records)
lt (42)503 (Minimum count)
14Two Problems
- Problem 2
- Are small itemsets still small?
- e.g. Minsup50
Ds count1
Ds count2
New transactions (2 records)
Original database (4 records)
gt (42)503 (Minimum count)
15Maintenance
- Intuition Re-mining for each data insertion
Updated database
Re-mining
1. Waste discovered information 2. Spend
computational time
16Related Research
- FUP algorithm
- Negative border algorithms
- Pre-large itemset algorithm
17FUP Algorithm
- Cheung et al. 1995
- For efficient incremental data mining
- Step 1 Record all large itemsets
18Four Cases in FUP
19FUP Algorithm
- Step 2 Process new transactions
Originally large itemsets
Re-calculating the updated item counts
Large itemsets in new transactions?
Originally small itemsets
Small itemsets in new transactions
Reduce rescanning database
20Case 3 in FUP
- Original Small
- New Large
- Rescan the original database
Re-scanning original database
21Negative Border
- The candidate itemsets without enough supports
- e.g. Negative 1-itemsets
-
?
22Negative Border
?
- Negative border all negative itemsets
23Negative Border Algorithm
- Pre-store all candidate itemsets
- Large itemsets
- Negative itemsets
24Negative Border Algorithm
- Process candidates from new transactions
- Three cases
- Originally large itemset
- Originally Negative Itemset
- Calculate the counts directly
- e.g. D
- Others
- Rescan
- For Case 3 in FUP
- Reduce rescanning probability
25Pre-Large Itemsets
- For efficiently handle Case 3 in FUP
26Pre-Large Itemset
- Not truly large
- Acting like buffer
- Reduing the movement from large directly to small
or vice verse. - Lying between the lower and upper thresholds
Area of pre-large itemsets
Area of small itemsets
Area of large itemsets
27Pre-Large Algorithm
- Record all large and pre-large itemsets in the
original database - Nine cases are generated
28Results of Nine Cases
29Safety Bound of Insertions
- For Case 7
- Small itemset in the original database,
- but large itemset in the new transactions
- To obtain a safety bound of insertions t
- If the number of new insertions gt t, then rescan
the original database - Otherwise, do nothing
Large itemsets area
Small itemsets area
Upper threshold
Lower threshold
30Theoretical Foundation
- Let
- Su Upper support threshold
- Sl Low support threshold
- d Transaction number in the original database
- t Allowed insertion number for not rescanning
- the original database
- Theorem
- if
- then no rescan!
31Example
Safety bound t25
32Proof
Is count of updated database
Is count of original database
Is count of new inserted transactions
33Safety Bound of Lower Threshold
- To obtain a safety bound of threshold Sl
- Given fixed insertions t
- Derive the corresponding Sl
- If new insertion number gt t,
- then rescan the original database
- and re-adjust the Sl
- Otherwise, do nothing
34Theoretical Foundation
- Theorem
- if
- then no rescan!
- Corollary
- if
- then no rescan
35Example
Safety bound Sl0.56
36Example
37Some Advantages
- From
- When the database grows larger,
- ?The allowed number of new insertions
- is also larger ?
38Some Advantages
- From
- When the database grows larger,
- ?The allowed lower threshold Sl
- is closer to Su
39Diagram
Re-calculating the updated item counts
lt safety bound
Originally small itemsets
gt safety bound
40Proposed Algorithm
- Step1 record all originally large and pre-large
- itemsets
- Step2 calculate the safety bound
- Step3 scan the new transactions
- Step4 divide candidates in new transactions
- into three parts
- a. original large itemsets
- b. original pre-large itemsets
- c. small itemsets
- Step5 check whether the updated itemsets are
- large, pre-large, or small?
- Step6 if new transactions lt safety bound,
- do nothing,
- otherwise, rescan the original
database
41Experiments
42Experiments
43Experiments
44Other Problems
Original database
?
A,B,C,E BC,BE,CE BCE
Deletion
45Other Problems
- Modification of old records
Original database
?
A,B,C,E BC,BE,CE BCE
Modification
46Other Problems
- Sequential Patterns
- e.g.
- Customer 1 1000 BCD
- Customer 2 1100 BC
- Customer 1 1500 BCDE
- Customer 2 1700 EF
- Customer 1 BCD -gt BCDE
- Customer 2 BC -gt EF
- gt BC -gt E
47Summary
- Pre-large concept
- Efficient incremental mining algorithms
- Retain features of FUP
- Avoiding re-computing large itemsets
- Focusing on newly inserted transactions
- Filtering the candidate itemsets
- Become increasingly efficient along with growth
of database
48Motivation for Online Mining
- Online Data mining
- Iterative
- Interactive
- Changing requirement
- e.g. varying threshold (e.g. minsup)
- e.g. changing data over time
49Why Multi-Dimensional Online Mining
SimonBike company
Los Angeles branch
San Francisco branch
New York branch
50Examples
- From Los Angeles and San Francisco in the first
quarters of the latest five years - The minimum support increasing from 5 to 10
- Popular combinations of products sold in July
last year
51Multidimensional Online Mining
Classic mining framework
Multidimensional online mining framework
52Framework
Multidimensional pattern relation
Underlying database or data warehouse
- Multidimensional
- online mining
2003/12 Los Angeles
2003/1
2003/10
MiningQuery 1
New York
Los Angeles
2003/2
2003/11
New block of data
Mining Query 2
New York
Los Angeles
? Data Insertion
2003/3
2003/12
Los Angeles
New York
53Framework
- Assumption
- Data evolving in a systematic way
- Inserted or deleted in a block during a time
interval - e.g. midnight of a day
- e.g. a month
- Mining Knowledge Warehouse
- Storing mined information systematically and
structurally - Multidimensional pattern relation (MPR)
54Multidimensional Pattern Relation
Assume Initial minimum support 5
Context (Circumstances)
Content (Mining Information)
55Multidimensional Online Mining
- Three-phase Online Association Rule Mining
(TOARM) - Based on MPR
- Three phases
- Generation of candidate itemsets
- Pruning of candidate itemsets
- Generation of large itemsets
56Phase 1
- Generation of candidate itemsets satisfying a
mining request - e.g. Find large itemsets satisfying minsup 5.5
from data collected from CA in 1998/11 to 1998/12 - Selecting tuples satisfying contexts
v
v
v
v
57Phase 1
- Generating candidates from these matched tuples
- Lemma For each candidate itemset x, there exists
at least a matched tuple such that its support
satisfies query support
Candidates A, B, C, AB
58Phase 2
- Pruning of candidate itemsets
- Calculating upper-bound supports of candidates
gt Prune A
59Phase 2
gt Send B to Phase 3
60Phase 2
gt Keep C in large itemsets
gt Remove AB
0.053
61Phase 3
- Generation of Large Itemsets
- Rescanning the blocks of candidate B for tuple
3 and 5 - Obtaining their physical supports
62Multi-Dimensional Online Mining
- Helping on-line decision support
- From only interested portion of data
- From different aspects
- e.g. time, location, threshold
63Extended MPR (EMPR)
- Keeping additional negative-border information
e.g. Initial minimum support 5
64Negative-Border Online Mining (NOM)
- Based on EMPR
- TOARM ? NOM
- Appearing Count
- Frequent pattern
- Negative pattern
- Not Appearing Count
- TOARM -gt ns 1
- NOM -gt min(ns -1, nmin(sx))
- x subset of x
?
65Example
- NOM -gt min(ns -1, nmin(sx))
- x subset of x
- Example
AC Candidate tuple 1 n17 tuple 2 n2min(5,
2) n22
?
66Lattice-based NOM (LNOM)
- Implementing NOM more efficiently
- Using the lattice data structure
- Using hashing techniques
- spanning-tree-count-calculating (STCC) algorithm
67Experiments
- In Java
- On a workstation
- with dual XEON (2.8GHz) processors
- with 2G main memory.
- RedHat 9.0
- Dataset
- Several synthetic datasets (IBM data generator)
- A real-world dataset, BMS-POS (KDDCUP 2000)
68Notation
69Data Sets
70Experiment TOARM
- Performance comparison synthetic datasets
71Experiment TOARM
- Performance comparison BMS-POS dataset
72Experiment TOARM
73Experiment TOARM vs. NOM
- Performance comparison synthetic datasets
74Experiment TOARM vs. NOM
- Execution time in Phase 3
- NOM lt TOARM
- Execution time in Phases 1 and 2
- NOM gt TOARM
- NOM over TOARM
- Small item set
- Large data size
75Experiment NOM vs. LNOM
- Performance comparison synthetic datasets
76Experiment NOM vs. LNOM
- Performance comparison BMS-POS dataset
77Summary
- Multi-dimensional online mining
- Effectively utilizing patterns previously mined
- Knowledge warehouse
- Multidimensional pattern relation
- Extended multidimensional Pattern Relation
- Mining approaches
- TOARM
- NOM
- LNOM
- Experimental results
- Showing good performance
78Thank You