Title: Mining Frequent Itemsets from Uncertain Data
1Mining Frequent Itemsets from Uncertain Data
Chun-Kit Chui and Ben Kao Department of Computer
Science The University of Hong Kong.
2Presentation Outline
- Introduction
- Existential uncertain data model
- Possible world interpretation of existential
uncertain data - The U-Apriori algorithm
- Efficient mining
- Data trimming framework
- Decremental approach
- Experimental results and discussions
- Conclusion
3Introduction
- Existential Uncertain Data Model
4Introduction
Traditional Transaction Dataset
Psychological Symptoms Dataset
Mood Disorder Anxiety Disorder Eating Disorder Obsessive-Compulsive Disorder Depression Self Destructive Disorder
Patient 1
Patient 2
- The psychologists maybe interested to find the
following associations between different
psychological symptoms.
These associations are very useful information to
assist diagnosis and give treatments.
Mood disorder gt Eating disorder
Eating disorder gt Depression Mood disorder
- Mining frequent itemsets is an essential step in
association analysis. - E.g. Return all itemsets that exist in s or more
of the transactions in the dataset.
In traditional transaction dataset, whether an
item exists in a transaction is well-defined.
5Introduction
Existential Uncertain Dataset
Psychological Symptoms Dataset
Mood Disorder Anxiety Disorder Eating Disorder Obsessive-Compulsive Disorder Depression Self Destructive Disorder
Patient 1
97
5
84
14
76
9
Patient 2
90
85
100
48
86
65
- In many applications, the existence of an item in
a transaction is best captured by a likelihood
measure or a probability. - Symptoms, being subjective observations, would
best be represented by probabilities that
indicate their presence. - The likelihood of presence of each symptom is
represented in terms of existential
probabilities. - What is the definition of support in uncertain
dataset?
6Existential Uncertain Dataset
Existential Uncertain Dataset
Item 1 Item 2
Transaction 1 90 85
Transaction 2 60 5
- An existential uncertain dataset is a transaction
dataset in which each item is associated with an
existential probability indicating the
probability that the item exists in the
transaction. - Other applications of existential uncertain
datasets - Handwriting recognition, Speech recognition
- Scientific Datasets
7Possible World Interpretation
- The definition of frequency measure of an itemset
in the existential uncertain dataset
by S. Abiteboul in the paper On the
Representation and Querying of Sets of Possible
Worlds in SIGMOD 1987.
8Possible World Interpretation
Psychological symptoms dataset
- Example
- A dataset with two psychological symptoms and two
patients. - 16 Possible Worlds in total.
- The support counts of itemsets are well defined
in each individual world.
Depression Eating Disorder
Patient 1 90 80
Patient 2 40 70
1 S1 S2
P1 v v
P2 v v
2 S1 S2
P1 v
P2 v v
3 S1 S2
P1 v
P2 v v
From the dataset, one possibility is that both
patients are actually having both psychological
illnesses.
4 S1 S2
P1 v v
P2 v
5 S1 S2
P1 v v
P2 v
6 S1 S2
P1 v v
P2
On the other hand, the uncertain dataset also
captures the possibility that patient 1 only has
eating disorder illness while patient 2 has both
of the illnesses.
9 S1 S2
P1 v
P2 v
10 S1 S2
P1 v
P2 v
11 S1 S2
P1 v
P2 v
8 S1 S2
P1 v
P2 v
7 S1 S2
P1
P2 v v
14 S1 S2
P1
P2 v
15 S1 S2
P1
P2 v
12 S1 S2
P1 v
P2
16 S1 S2
P1
P2
13 S1 S2
P1 v
P2
9Possible World Interpretation
Psychological symptoms dataset
- Support of itemset Depression,Eating Disorder
Depression Eating Disorder
Patient 1 90 80
Patient 2 40 70
World Di Support S1,S2 World Likelihood
1
2
3
4
5
6
7
8
2
0.9 0.8 0.4 0.7
0.2016
1 S1 S2
P1 v v
P2 v v
2 S1 S2
P1 v
P2 v v
3 S1 S2
P1 v
P2 v v
1
0.0224
4 S1 S2
P1 v v
P2 v
5 S1 S2
P1 v v
P2 v
6 S1 S2
P1 v v
P2
We can also discuss the likelihood of possible
world 1 being the true world.
0
We can discuss the support count of itemset
S1,S2 in possible world 1.
9 S1 S2
P1 v
P2 v
10 S1 S2
P1 v
P2 v
11 S1 S2
P1 v
P2 v
8 S1 S2
P1 v
P2 v
7 S1 S2
P1
P2 v v
We define the expected support being the weighted
average of the support counts represented by ALL
the possible worlds.
14 S1 S2
P1
P2 v
15 S1 S2
P1
P2 v
12 S1 S2
P1 v
P2
16 S1 S2
P1
P2
13 S1 S2
P1 v
P2
10Possible World Interpretation
World Di Support S1,S2 World Likelihood
1
2
3
4
5
6
7
8
Weighted Support
0.4032
0.0224
0.0504
0.3024
0.0864
0.1296
0.0056
0
0
2
0.2016
1
0.0224
Expected Support is calculated by summing up the
weighted support counts of ALL the possible
worlds.
0
Expected Support 1
We define the expected support being the weighted
average of the support counts represented by ALL
the possible worlds.
To calculate the expected support, we need to
consider all possible worlds and obtain the
weighted support in each of the enumerated
possible world.
We expect there will be 1 patient has both
Eating Disorder and Depression.
11Possible World Interpretation
- Instead of enumerating all Possible Worlds to
calculate the expected support, it can be
calculated by scanning the uncertain dataset once
only.
where Pti(xj) is the existential probability of
item xj in transaction ti.
Psychological symptoms database
The expected support of S1,S2 can be calculated
by simply multiplying the existential
probabilities within the transaction and obtain
the total sum of all transactions.
S1 S2
Patient 1 90 80
Patient 2 40 70
Weighted Support of S1,S2
0.72
0.28
Expected Support of S1,S2 1
12Mining Frequent Itemsets from Uncertain Data
- Problem Definition
- Given an existential uncertain dataset D with
each item of a transaction associated with an
existential probability, and a user-specified
support threshold s, return ALL the itemsets
having expected support greater than or equal to
D s.
13Mining Frequent Itemsets from Uncertain Data
14The Apriori Algorithm
The Subset Function scans the dataset once and
obtain the support counts of ALL
size-1-candidates.
Item A is infrequent, by the Apriori Property,
ALL supersets of A must NOT be frequent.
Subset Function
The Apriori-Gen procedure generates ONLY those
size-(k1)-candidates which are potentially
frequent.
The Apriori algorithm starts from inspecting ALL
size-1 items.
Large itemsets
Candidates
X
A B C D E
B C D E
BC BD BE CD CE DE
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Apriori-Gen
X
15The Apriori Algorithm
Subset Function
The algorithm iteratively prunes and verifies the
candidates, until no candidates are generated.
Large itemsets
Candidates
X
B C D E
BC BD BE CD CE DE
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Apriori-Gen
X
16The Apriori Algorithm
Subset Function
- The Subset-Function scans the dataset
transaction by transaction to update the support
counts of the candidates.
Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
Recall that in the uncertain data model, each
item is associated with an existential
probability.
Expected Support Count
Candidate Itemset Support Count
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
17The Apriori Algorithm
Subset Function
Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
Expected Support Count
Candidate Itemset
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
We call this minor modified algorithm the
U-Apriori algorithm, which serves as the
brute-force approach of mining the uncertain
datasets.
0.72
0.54
0.0018
0.03
0.0001
18The Apriori Algorithm
Subset Function
Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
For this insignificant support increments,
U-Apriori still has to traverse the hash tree and
find the corresponding entry 4,8, and increment
the support count.
Expected Support Count
Candidate Itemset
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
We call this minor modified algorithm the
U-Apriori algorithm, which serves as the
brute-force approach of mining the uncertain
datasets.
0.72
0.54
0.0018
0.03
0.0001
19Previous workThe Data Trimming Framework
- Avoid incrementing those insignificant expected
support counts.
20Data Trimming Framework
Trimming phase
Statistics
Trimming threshold 10
Total expected support count being trimmed Maximum existential probability being trimmed
I1 1.1 5
I2 1.2 3
Original dataset
Statistics
I1 I2
t1 90 80
t2 80 4
t3 2 5
t4 5 95
t5 94 95
Trimming Module
Trimmed dataset
I1 I2
t1 90 80
t2 80
t4 95
t5 94 95
Trimmed Dataset
The uncertain dataset is first trimmed to remove
the items with low existential probability.
21Data Trimming Framework
Trimming phase
Mining phase
Kth - iteration
Pruning Module
Statistics
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
The uncertain database is first trimmed to remove
the items with low existential probability.
Mine the trimmed dataset.
22Data Trimming Framework
Trimming phase
Mining phase
Patch-up phase
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The uncertain database is first trimmed to remove
the items with low existential probability.
The patch up phase to recover the true frequent
itemsets that are missed when mining the trimmed
dataset.
Mine the trimmed dataset.
23Data Trimming Framework
Trimming phase
Mining phase
Patch-up phase
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
OVERHEAD
SAVING
OVERHEAD
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The performance of the trimming framework is
sensitive to the trimming threshold, which is
hard to be determined.
Saving can be counterproductive if there are few
low probability items.
Have to scan the original dataset at least once,
scanning the original dataset is an expensive
operation.
24Data Trimming Framework
- The main drawback of Data Trimming is that its
performance is sensitive to the percentage of low
probability items.
When there are few low probability items in the
dataset, data trimming can be counterproductive.
25Decremental Pruning
- Relatively insensitive to the percentage of low
probability items in the dataset
26Decremental Pruning
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
- The decremental pruning technique achieves
candidate reduction during the mining process. - Estimates the upper bounds of expected supports
of the candidate itemsets progressively after
each dataset transaction is processed.
27Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Scan
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
The U-Apriori algorithm scans the entire dataset
to obtain the expected support of candidate
itemset a,b before it can be identified
infrequent.
28Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Scan
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
In the decremental pruning technique, we
progressively estimate the upper bound of
Se(a,b). Before scanning a transaction, the
upper bound equal to the expected support of
singleton a.
Expected support count of a,
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6
In this estimation, we are assuming that the
existential probabilities of singleton b are
100 in all transactions.
29Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
After scanning t1, we know that the existential
probability of b is 0.5. Previously, we assumed
that the existential probability of b is 1, but
it is now 0.5. Therefore, we can deduce that the
upper bound of Se( a,b ) has been overestimated
by 0.5.
Expected support count of a,
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
30Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
Since the existential probability of singleton
b in transaction t2 is 0.8, we have
overestimated the existential probability of b
by 1-0.80.2. Therefore, the upper bound of Se(
a,b ) is overestimated by 0.90.2 0.18.
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92
31Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
Now, since the upper bound of Se(a,b) is less
than the minimum expected support count, a,b is
identified infrequent and can be pruned.
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92
32Decremental Pruning
The beginning of the 2nd iteration
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
We denote as the
upper bound of Se( a,b ), assuming that the
item b has probability of 100 from transaction
tk to the last transaction.
Similarly, we can also assume that the item a
has probability of 100 in the entire dataset.
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
Decremental Counters
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92
2.1
2.1-(0.5(1-1)) 2.1
2.1-(0.8(1-0.9)) 2.036
33Decremental Pruning
Decremental counter initialization and update
process.
Proof.
34Decremental Pruning
- The brute-force decremental counter method is
infeasible because - Too many decremental counters to maintain.
- Ineffective to prune the candidates one by one.
Level 0
Level 1
b,d
b,g
a,k
a,g
b,e
b,f
c,d
c,e
c,f
Candidate Itemset Expected Support Count Decremental Counters
a,b
a,g
35Aggregate by singleton method (AS)
- Aggregate by singleton method (AS)
- Reduces the number of decremental counters to the
number of frequent singletons.
Suppose there are three size-2-candidates
6 decremental counters to maintain in total.
Decremental Counters
Singleton Decremental Counters
Size-2 Candidates
36Aggregate by singleton method (AS)
- Aggregate by singleton method (AS)
- Reduces the number of decremental counters to the
number of frequent singletons. - If the counter ds(a,k) drops below the minimum
support requirement, candidates that contain a
must be infrequent and can be pruned.
Suppose there are three size-2-candidates
6 decremental counters to maintain in total.
Decremental Counters
Singleton Decremental Counters
Size-2 Candidates
37Common prefix method (CP)
- Another method is to aggregate the decremental
counters with common prefix. - We only keep those decremental counters
where X is a proper prefix of X.
(i.e. ) - We use a prefix decremental counter
as an upper bound of ALL decremental
counters and
.
Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
38Common prefix method (CP)
- Another method is to aggregate the decremental
counters with common prefix. - We only keep those decremental counters
where X is a proper prefix of X.
(i.e. ) - We use a prefix decremental counter
as an upper bound of ALL decremental
counters and
.
Many frequent itemset mining algorithms use a
prefix-tree like data structure to organize
candidates.
Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
c,e
39Common prefix method (CP)
- Candidates with common prefix are organized under
the same subtree in a prefix tree (hash tree). - If the value of a prefix decremental counter
drops below the minimum support requirement, all
the candidates in the corresponding subtree can
be pruned.
Many frequent itemset mining algorithms use a
prefix-tree like data structure to organize
candidates.
X
Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
c,e
40Common prefix method (CP)
- Improve the pruning power of the CP method
- Item with higher order will be the prefix of more
candidates.
If this counter drops lower than the minimum
support count, we can prune 3 itemsets.
If this counter drops lower than the minimum
support count, we can prune 1 itemset only.
a,b a,c a,d
b,c b,d
c,d
Item a is the prefix of 3 candidates
Item c is the prefix of 1 candidate only.
41Common prefix method (CP)
- Our strategy is to reorder the items by their
expected supports in ascending order - The decremental counters of items in higher
orders will be more likely to be lower than them
minimum support threshold.
If this counter drops lower than the minimum
support count, we can prune 3 itemsets.
If this counter drops lower than the minimum
support count, we can prune 1 itemset only.
a,b a,c a,d
b,c b,d
c,d
Item a is the prefix of 3 candidates
Item c is the prefix of 1 candidate only.
42Details of the aggregation methods
Initialization and update of singleton
decremental counter
Initialization and update of prefix decremental
counter
43Experiments and Discussions
44Synthetic datasets
Step 1 Generate data without uncertainty. IBM
Synthetic Datasets Generator Average length of
each transaction (T 25) Average length of
frequent patterns (I 6) Number of transactions
(D 100K)
IBM Synthetic Datasets Generator
TID Items
1 2,4,9
2 5,4,10
3 1,6,7
Data Uncertainty Simulator
The proportion of items with low probabilities is
controlled by the parameter R (R75).
High probability items generator
Low probability items generator
Step 2 Introduce existential uncertainty to
each item in the generated dataset.
TID Items
1 2(90), 4(80), 9(30), 10(4), 19(25)
2 5(75), 4(68), 10(100), 14(15), 19(23)
3 1(88), 6(95), 7(98), 13(2), 18(7), 22(10), 25(6)
- Assign relatively high probabilities to the items
in the generated dataset. - Normal Distribution (mean 75, standard
deviation 15)
Assign more items with relatively low
probabilities to each transaction. Normal
Distribution (mean 25, standard deviation
15)
45Performance of Decremental Pruning
Both AS and CP reduce the candidate set in a
progressive way. CP reduces the candidate set by
about 20 after 60 of the transactions are
processed. The pruning power of CP is higher than
AS.
46Performance of Decremental Pruning
Both AS and CP reduce the candidate set in a
progressive way. CP reduces the candidate set by
about 20 after 60 of the transactions are
processed. The pruning power of CP is higher than
AS.
The 2nd iteration is the computational
bottleneck. Although CP pruned twice as much as
AS, the CPU cost saving is not doubled because CP
requires a more complex recursive strategy to
maintain the prefix decremental counters.
47Compare with Data Trimming
The performance of Data Trimming is very
sensitive to R. The performance of the
decremental approaches is relatively stable.
The combined method, which uses both CP and Data
Trimming strikes a good balance and gives
consistently good performance.
48Vary minimum support threshold
CP performs slightly better than AS in a wide
range of support threshold.
A larger support implies the minimum support is
larger, it is easier for the decremental counters
to drop below the required value and more
candidates can be pruned early.
49Conclusion
- We study the problem of mining frequent itemsets
under a probabilistic framework. - Data Trimming technique
- Works well when there are substantial amount of
items with low existential probabilities in the
dataset. - Counterproductive when there are few low
probability items in the dataset. - We proposed the decremental pruning technique
that achieves candidate reduction during the
mining process. - Experimental results show that the Decremental
Pruning and Data Trimming techniques can be
combined to yield the most stable and efficient
algorithm.
50Thank you!
51Appendix
52Data Trimming Framework
- Direction
- Try to avoid incrementing those insignificant
expected support counts. - Save the effort for
- Traversing the hash tree.
- Computing the expected support count.
(Multiplication of float variables) - The I/O for retrieving the items with very low
existential probability.
53Computational Issue
- Preliminary experiment to verify the
computational bottleneck of mining uncertain
datasets. - 7 synthetic datasets with same frequent itemsets.
- Vary the percentages of items with low
existential probability (R) in the datasets.
1
2
3
4
5
6
7
0
33.33
50
60
66.67
75
71.4
54Computational Issue
CPU cost in each iteration of different datasets
The dataset with 75 low probability items has
many insignificant support increments. Those
insignificant support increments maybe redundant.
7
75
This gap can potentially be reduced.
1
0
Although all datasets contain the same frequent
itemsets, U-Apriori requires different amount of
time to execute.
Iterations
55Data Trimming Framework
Uncertain dataset
Trimmed dataset
Statistics
I1 I2
t1 90 80
t2 80 4
t3 2 5
t4 5 95
t5 94 95
I1 I2
t1 90 80
t2 80
t4 95
t5 94 95
Total expected support count being trimmed Maximum existential probability being trimmed
I1 1.1 5
I2 1.2 3
- Create a trimmed dataset by trimming out all
items with low existential probabilities. - During the trimming process, some statistics are
kept for error estimation when mining the trimmed
dataset. - Total expected support count being trimmed of
each item. - Maximum existential probability being trimmed of
each item. - Other information e.g. inverted lists,
signature files etc
56Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
57Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
Trimmed Dataset
Uncertain Apriori
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
58Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
59Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
Pruning Module
Statistics
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
60Data Trimming Framework
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Kth - iteration
Pruning Module
Statistics
The potentially frequent itemsets are passed back
to the Uncertain Apriori algorithm to generate
candidates for the next iteration.
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
61Data Trimming Framework
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
The potentially frequent itemsets are verified by
the patch up module against the original dataset.
62Data Trimming Framework
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
63Data Trimming Framework
There are three modules under the data trimming
framework, each module can have different
strategies.
What statistics are used in the pruning strategy?
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
Can we use a single scan to verify all the
potentially frequent itemsets or multiple scans
over the original dataset?
The trimming threshold is global to all items or
local to each item?
64Data Trimming Framework
There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
- To what extend do we trim the dataset?
- If we trim too little, the computational cost
saved cannot compensate for the overhead. - If we trim too much, mining the trimmed dataset
will miss many frequent itemsets, pushing the
workload to the patch up module.
65Data Trimming Framework
- The role of the pruning module is to estimate the
error of mining the trimmed dataset. - Bounding techniques should be applied here to
estimate the upper bound and/or lower bound of
the true expected support of each candidate.
There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
- To what extend do we trim the dataset?
- If we trim too little, the computational cost
saved cannot compensate for the overhead. - If we trim too much, mining the trimmed dataset
will miss many frequent itemsets, pushing the
workload to the patch up module.
66Data Trimming Framework
- The role of the pruning module is to estimate the
error of mining the trimmed dataset. - Bounding techniques should be applied here to
estimate the upper bound and/or lower bound of
the true expected support of each candidate.
There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
- To what extend do we trim the dataset?
- If we trim too little, the computational cost
saved cannot compensate for the overhead. - If we trim too much, mining the trimmed dataset
will miss many frequent itemsets, pushing the
workload to the patch up module.
- We try to adopt a single-scan patch up strategy
so as to save the I/O cost of scanning the
original dataset. - To achieve this strategy, the potentially
frequent itemsets outputted by the pruning module
should contains all the true frequent itemsets
missed in the mining process.