Incremental and OnLine Data Mining

About This Presentation

Title:

Incremental and OnLine Data Mining

Description:

Insertion. T. P. Hong. 13. Two Problems. Problem 1: Are large itemsets still large? e.g.: Minsup=50 ... To obtain a safety bound of insertions t ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 79

Provided by: cat972

Category:

more less

Transcript and Presenter's Notes

Title: Incremental and OnLine Data Mining

1
Incremental and On-Line Data Mining

Prof. Tzung-Pei Hong
Department of Electrical Engineering National
University of Kaohsiung

2
Outline

Introduction
Apriori Mining Algorithm
Incremental Mining
FUP algorithm
Negative border algorithm
Pre-large itemset algorithm

3
Outline

Multi-dimensional Online Mining
Knowledge Warehouse
Three-phase Online Association Rule Mining
(TOARM)
Negative-Border Online Mining (NOM)
Lattice-based NOM (LNOM)
Conclusion

4
Why Data Mining
How to arrange goods into supermarket?
5
Association Rules
IF bread is bought then milk is bought
6
The Role of Data Mining
Useful patterns
Transaction data
Data Mining
Knowledge and strategy
Preprocess data
7
Different Kinds of Knowledge

Association rules
Generalized association rules
Sequential patterns
Quantitative association rules
Classification rules
Clustering rules
etc

8
Association Rules
IF bread is bought then milk is bought
9
Apriori Algorithm

Proposed by Agrawal et al.
Step1 Define minsup and minconf
ex minsup50
minconf50
Step2 Find large itemsets
Step3 Generate association rules

10
Example
Large itemsets
Scan Database
L
1
Itemset
Sup.
A
2
B
3
C
3
E
3
Scan Database
Scan Database
11
Example

12
Maintenance of Association Rules

Insertion of new transactions

New transactions
Original database
?
13
Two Problems

Problem 1
Are large itemsets still large?
e.g. Minsup50

As count2
As count0
New transactions (2 records)
Original database (4 records)
lt (42)503 (Minimum count)
14
Two Problems

Problem 2
Are small itemsets still small?
e.g. Minsup50

Ds count1
Ds count2
New transactions (2 records)
Original database (4 records)
gt (42)503 (Minimum count)
15
Maintenance

Intuition Re-mining for each data insertion

Updated database
Re-mining
1. Waste discovered information 2. Spend
computational time
16
Related Research

FUP algorithm
Negative border algorithms
Pre-large itemset algorithm

17
FUP Algorithm

Cheung et al. 1995
For efficient incremental data mining
Step 1 Record all large itemsets

18
Four Cases in FUP
19
FUP Algorithm

Step 2 Process new transactions

Originally large itemsets
Re-calculating the updated item counts
Large itemsets in new transactions?
Originally small itemsets
Small itemsets in new transactions
Reduce rescanning database
20
Case 3 in FUP

Original Small
New Large
Rescan the original database

Re-scanning original database
21
Negative Border

The candidate itemsets without enough supports
e.g. Negative 1-itemsets

?

22
Negative Border

e.g. Negative 2-itemsets

Negative border all negative itemsets

23
Negative Border Algorithm

Pre-store all candidate itemsets
Large itemsets
Negative itemsets

24
Negative Border Algorithm

Process candidates from new transactions
Three cases
Originally large itemset
Originally Negative Itemset
Calculate the counts directly
e.g. D
Others
Rescan
For Case 3 in FUP
Reduce rescanning probability

25
Pre-Large Itemsets

For efficiently handle Case 3 in FUP

26
Pre-Large Itemset

Not truly large
Acting like buffer
Reduing the movement from large directly to small
or vice verse.
Lying between the lower and upper thresholds

Area of pre-large itemsets
Area of small itemsets
Area of large itemsets
27
Pre-Large Algorithm

Record all large and pre-large itemsets in the
original database
Nine cases are generated

28
Results of Nine Cases
29
Safety Bound of Insertions

For Case 7
Small itemset in the original database,
but large itemset in the new transactions
To obtain a safety bound of insertions t
If the number of new insertions gt t, then rescan
the original database
Otherwise, do nothing

Large itemsets area
Small itemsets area
Upper threshold
Lower threshold
30
Theoretical Foundation

Let
Su Upper support threshold
Sl Low support threshold
d Transaction number in the original database
t Allowed insertion number for not rescanning
the original database
Theorem
if
then no rescan!

31
Example

Let
Su 60
Sl 50
d 100
t

Safety bound t25
32
Proof
Is count of updated database
Is count of original database
Is count of new inserted transactions
33
Safety Bound of Lower Threshold

To obtain a safety bound of threshold Sl
Given fixed insertions t
Derive the corresponding Sl
If new insertion number gt t,
then rescan the original database
and re-adjust the Sl
Otherwise, do nothing

34
Theoretical Foundation

Theorem
if
then no rescan!
Corollary
if
then no rescan

35
Example

Let
Su 60
d 100
t 10
Sl

Safety bound Sl0.56
36
Example

Let
Su 60
d 100
t 10

37
Some Advantages

From
When the database grows larger,
?The allowed number of new insertions
is also larger ?

38
Some Advantages

From
When the database grows larger,
?The allowed lower threshold Sl
is closer to Su

39
Diagram

Re-calculating the updated item counts
lt safety bound
Originally small itemsets
gt safety bound
40
Proposed Algorithm

Step1 record all originally large and pre-large
itemsets
Step2 calculate the safety bound
Step3 scan the new transactions
Step4 divide candidates in new transactions
into three parts
a. original large itemsets
b. original pre-large itemsets
c. small itemsets
Step5 check whether the updated itemsets are
large, pre-large, or small?
Step6 if new transactions lt safety bound,
do nothing,
otherwise, rescan the original
database

41
Experiments
42
Experiments
43
Experiments
44
Other Problems

Deletion of old records

Original database
?
A,B,C,E BC,BE,CE BCE
Deletion
45
Other Problems

Modification of old records

Original database
?
A,B,C,E BC,BE,CE BCE
Modification
46
Other Problems

Sequential Patterns
e.g.
Customer 1 1000 BCD
Customer 2 1100 BC
Customer 1 1500 BCDE
Customer 2 1700 EF
Customer 1 BCD -gt BCDE
Customer 2 BC -gt EF
gt BC -gt E

47
Summary

Pre-large concept
Efficient incremental mining algorithms
Retain features of FUP
Avoiding re-computing large itemsets
Focusing on newly inserted transactions
Filtering the candidate itemsets
Become increasingly efficient along with growth
of database

48
Motivation for Online Mining

Online Data mining
Iterative
Interactive
Changing requirement
e.g. varying threshold (e.g. minsup)
e.g. changing data over time

49
Why Multi-Dimensional Online Mining

SimonBike company
Los Angeles branch
San Francisco branch
New York branch
50
Examples

From Los Angeles and San Francisco in the first
quarters of the latest five years
The minimum support increasing from 5 to 10
Popular combinations of products sold in July
last year

51
Multidimensional Online Mining
Classic mining framework
Multidimensional online mining framework
52
Framework
Multidimensional pattern relation

Mining

Underlying database or data warehouse

Multidimensional
online mining

2003/12 Los Angeles

2003/1
2003/10
MiningQuery 1
New York
Los Angeles

2003/2
2003/11
New block of data
Mining Query 2
New York
Los Angeles
? Data Insertion

2003/3
2003/12
Los Angeles
New York
53
Framework

Assumption
Data evolving in a systematic way
Inserted or deleted in a block during a time
interval
e.g. midnight of a day
e.g. a month
Mining Knowledge Warehouse
Storing mined information systematically and
structurally
Multidimensional pattern relation (MPR)

54
Multidimensional Pattern Relation
Assume Initial minimum support 5

Context (Circumstances)
Content (Mining Information)
55
Multidimensional Online Mining

Three-phase Online Association Rule Mining
(TOARM)
Based on MPR
Three phases
Generation of candidate itemsets
Pruning of candidate itemsets
Generation of large itemsets

56
Phase 1

Generation of candidate itemsets satisfying a
mining request
e.g. Find large itemsets satisfying minsup 5.5
from data collected from CA in 1998/11 to 1998/12
Selecting tuples satisfying contexts

v
v
v
v
57
Phase 1

Generating candidates from these matched tuples
Lemma For each candidate itemset x, there exists
at least a matched tuple such that its support
satisfies query support

Candidates A, B, C, AB
58
Phase 2

Pruning of candidate itemsets
Calculating upper-bound supports of candidates

gt Prune A
59
Phase 2
gt Send B to Phase 3
60
Phase 2
gt Keep C in large itemsets
gt Remove AB
0.053
61
Phase 3

Generation of Large Itemsets
Rescanning the blocks of candidate B for tuple
3 and 5
Obtaining their physical supports

62
Multi-Dimensional Online Mining

Helping on-line decision support
From only interested portion of data
From different aspects
e.g. time, location, threshold

63
Extended MPR (EMPR)

Keeping additional negative-border information

e.g. Initial minimum support 5
64
Negative-Border Online Mining (NOM)

Based on EMPR
TOARM ? NOM
Appearing Count
Frequent pattern
Negative pattern
Not Appearing Count
TOARM -gt ns 1
NOM -gt min(ns -1, nmin(sx))
x subset of x

?
65
Example

NOM -gt min(ns -1, nmin(sx))
x subset of x
Example

AC Candidate tuple 1 n17 tuple 2 n2min(5,
2) n22
?
66
Lattice-based NOM (LNOM)

Implementing NOM more efficiently
Using the lattice data structure
Using hashing techniques
spanning-tree-count-calculating (STCC) algorithm

67
Experiments

In Java
On a workstation
with dual XEON (2.8GHz) processors
with 2G main memory.
RedHat 9.0
Dataset
Several synthetic datasets (IBM data generator)
A real-world dataset, BMS-POS (KDDCUP 2000)

68
Notation
69
Data Sets
70
Experiment TOARM

Performance comparison synthetic datasets

71
Experiment TOARM

Performance comparison BMS-POS dataset

72
Experiment TOARM

Scalability comparison

73
Experiment TOARM vs. NOM

Performance comparison synthetic datasets

74
Experiment TOARM vs. NOM

Execution time in Phase 3
NOM lt TOARM
Execution time in Phases 1 and 2
NOM gt TOARM
NOM over TOARM
Small item set
Large data size

75
Experiment NOM vs. LNOM

Performance comparison synthetic datasets

76
Experiment NOM vs. LNOM

Performance comparison BMS-POS dataset

77
Summary

Multi-dimensional online mining
Effectively utilizing patterns previously mined
Knowledge warehouse
Multidimensional pattern relation
Extended multidimensional Pattern Relation
Mining approaches
TOARM
NOM
LNOM
Experimental results
Showing good performance

78
Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Incremental and OnLine Data Mining - PowerPoint PPT Presentation

Incremental and OnLine Data Mining

Insertion. T. P. Hong. 13. Two Problems. Problem 1: Are large itemsets still large? e.g.: Minsup=50 ... To obtain a safety bound of insertions t ... – PowerPoint PPT presentation