Mining Frequent Itemsets from Uncertain Data - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Frequent Itemsets from Uncertain Data

Description:

Data trimming framework. Decremental approach. Experimental results and discussions ... The psychologists maybe interested to find the following associations between ... – PowerPoint PPT presentation

Number of Views:827
Avg rating:3.0/5.0
Slides: 67
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: Mining Frequent Itemsets from Uncertain Data


1
Mining Frequent Itemsets from Uncertain Data
Chun-Kit Chui and Ben Kao Department of Computer
Science The University of Hong Kong.
  • Presenter Chun-Kit Chui

2
Presentation Outline
  • Introduction
  • Existential uncertain data model
  • Possible world interpretation of existential
    uncertain data
  • The U-Apriori algorithm
  • Efficient mining
  • Data trimming framework
  • Decremental approach
  • Experimental results and discussions
  • Conclusion

3
Introduction
  • Existential Uncertain Data Model

4
Introduction
Traditional Transaction Dataset
Psychological Symptoms Dataset
Mood Disorder Anxiety Disorder Eating Disorder Obsessive-Compulsive Disorder Depression Self Destructive Disorder



Patient 1
Patient 2
  • The psychologists maybe interested to find the
    following associations between different
    psychological symptoms.

These associations are very useful information to
assist diagnosis and give treatments.
Mood disorder gt Eating disorder
Eating disorder gt Depression Mood disorder
  • Mining frequent itemsets is an essential step in
    association analysis.
  • E.g. Return all itemsets that exist in s or more
    of the transactions in the dataset.

In traditional transaction dataset, whether an
item exists in a transaction is well-defined.
5
Introduction
Existential Uncertain Dataset
Psychological Symptoms Dataset
Mood Disorder Anxiety Disorder Eating Disorder Obsessive-Compulsive Disorder Depression Self Destructive Disorder



Patient 1
97
5
84
14
76
9
Patient 2
90
85
100
48
86
65
  • In many applications, the existence of an item in
    a transaction is best captured by a likelihood
    measure or a probability.
  • Symptoms, being subjective observations, would
    best be represented by probabilities that
    indicate their presence.
  • The likelihood of presence of each symptom is
    represented in terms of existential
    probabilities.
  • What is the definition of support in uncertain
    dataset?

6
Existential Uncertain Dataset
Existential Uncertain Dataset
Item 1 Item 2
Transaction 1 90 85
Transaction 2 60 5
  • An existential uncertain dataset is a transaction
    dataset in which each item is associated with an
    existential probability indicating the
    probability that the item exists in the
    transaction.
  • Other applications of existential uncertain
    datasets
  • Handwriting recognition, Speech recognition
  • Scientific Datasets

7
Possible World Interpretation
  • The definition of frequency measure of an itemset
    in the existential uncertain dataset

by S. Abiteboul in the paper On the
Representation and Querying of Sets of Possible
Worlds in SIGMOD 1987.
8
Possible World Interpretation
Psychological symptoms dataset
  • Example
  • A dataset with two psychological symptoms and two
    patients.
  • 16 Possible Worlds in total.
  • The support counts of itemsets are well defined
    in each individual world.

Depression Eating Disorder
Patient 1 90 80
Patient 2 40 70
1 S1 S2
P1 v v
P2 v v
2 S1 S2
P1 v
P2 v v
3 S1 S2
P1 v
P2 v v
From the dataset, one possibility is that both
patients are actually having both psychological
illnesses.
4 S1 S2
P1 v v
P2 v
5 S1 S2
P1 v v
P2 v
6 S1 S2
P1 v v
P2
On the other hand, the uncertain dataset also
captures the possibility that patient 1 only has
eating disorder illness while patient 2 has both
of the illnesses.
9 S1 S2
P1 v
P2 v
10 S1 S2
P1 v
P2 v
11 S1 S2
P1 v
P2 v
8 S1 S2
P1 v
P2 v
7 S1 S2
P1
P2 v v
14 S1 S2
P1
P2 v
15 S1 S2
P1
P2 v
12 S1 S2
P1 v
P2
16 S1 S2
P1
P2
13 S1 S2
P1 v
P2
9
Possible World Interpretation
Psychological symptoms dataset
  • Support of itemset Depression,Eating Disorder

Depression Eating Disorder
Patient 1 90 80
Patient 2 40 70
World Di Support S1,S2 World Likelihood
1
2
3
4
5
6
7
8

2
0.9 0.8 0.4 0.7
0.2016
1 S1 S2
P1 v v
P2 v v
2 S1 S2
P1 v
P2 v v
3 S1 S2
P1 v
P2 v v
1
0.0224
4 S1 S2
P1 v v
P2 v
5 S1 S2
P1 v v
P2 v
6 S1 S2
P1 v v
P2
We can also discuss the likelihood of possible
world 1 being the true world.
0
We can discuss the support count of itemset
S1,S2 in possible world 1.
9 S1 S2
P1 v
P2 v
10 S1 S2
P1 v
P2 v
11 S1 S2
P1 v
P2 v
8 S1 S2
P1 v
P2 v
7 S1 S2
P1
P2 v v
We define the expected support being the weighted
average of the support counts represented by ALL
the possible worlds.
14 S1 S2
P1
P2 v
15 S1 S2
P1
P2 v
12 S1 S2
P1 v
P2
16 S1 S2
P1
P2
13 S1 S2
P1 v
P2
10
Possible World Interpretation
World Di Support S1,S2 World Likelihood
1
2
3
4
5
6
7
8

Weighted Support
0.4032
0.0224
0.0504
0.3024
0.0864
0.1296
0.0056
0
0
2
0.2016
1
0.0224
Expected Support is calculated by summing up the
weighted support counts of ALL the possible
worlds.
0
Expected Support 1
We define the expected support being the weighted
average of the support counts represented by ALL
the possible worlds.
To calculate the expected support, we need to
consider all possible worlds and obtain the
weighted support in each of the enumerated
possible world.
We expect there will be 1 patient has both
Eating Disorder and Depression.
11
Possible World Interpretation
  • Instead of enumerating all Possible Worlds to
    calculate the expected support, it can be
    calculated by scanning the uncertain dataset once
    only.

where Pti(xj) is the existential probability of
item xj in transaction ti.
Psychological symptoms database
The expected support of S1,S2 can be calculated
by simply multiplying the existential
probabilities within the transaction and obtain
the total sum of all transactions.
S1 S2
Patient 1 90 80
Patient 2 40 70
Weighted Support of S1,S2
0.72
0.28
Expected Support of S1,S2 1
12
Mining Frequent Itemsets from Uncertain Data
  • Problem Definition
  • Given an existential uncertain dataset D with
    each item of a transaction associated with an
    existential probability, and a user-specified
    support threshold s, return ALL the itemsets
    having expected support greater than or equal to
    D s.

13
Mining Frequent Itemsets from Uncertain Data
  • The U-Apriori algorithm

14
The Apriori Algorithm
The Subset Function scans the dataset once and
obtain the support counts of ALL
size-1-candidates.
Item A is infrequent, by the Apriori Property,
ALL supersets of A must NOT be frequent.
Subset Function
The Apriori-Gen procedure generates ONLY those
size-(k1)-candidates which are potentially
frequent.
The Apriori algorithm starts from inspecting ALL
size-1 items.
Large itemsets
Candidates
X
A B C D E
B C D E
BC BD BE CD CE DE
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Apriori-Gen
X
15
The Apriori Algorithm
Subset Function
The algorithm iteratively prunes and verifies the
candidates, until no candidates are generated.
Large itemsets
Candidates
X
B C D E
BC BD BE CD CE DE
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Apriori-Gen
X
16
The Apriori Algorithm
Subset Function
  • The Subset-Function scans the dataset
    transaction by transaction to update the support
    counts of the candidates.

Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
Recall that in the uncertain data model, each
item is associated with an existential
probability.
Expected Support Count
Candidate Itemset Support Count
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
17
The Apriori Algorithm
Subset Function
Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
Expected Support Count
Candidate Itemset
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
We call this minor modified algorithm the
U-Apriori algorithm, which serves as the
brute-force approach of mining the uncertain
datasets.
0.72
0.54
0.0018
0.03
0.0001
18
The Apriori Algorithm
Subset Function
Large itemsets
Candidates
Apriori-Gen
1 (90) 2 (80) 4 (5) 5 (60) 8 (0.2) 991 (95)
Transaction 1
For this insignificant support increments,
U-Apriori still has to traverse the hash tree and
find the corresponding entry 4,8, and increment
the support count.
Expected Support Count
Candidate Itemset
1,2 0
1,5 0
1,8 0
4,5 0
4,8 0
We call this minor modified algorithm the
U-Apriori algorithm, which serves as the
brute-force approach of mining the uncertain
datasets.
0.72
0.54
0.0018
0.03
0.0001
19
Previous workThe Data Trimming Framework
  • Avoid incrementing those insignificant expected
    support counts.

20
Data Trimming Framework
Trimming phase
Statistics
Trimming threshold 10
Total expected support count being trimmed Maximum existential probability being trimmed
I1 1.1 5
I2 1.2 3
Original dataset
Statistics
I1 I2
t1 90 80
t2 80 4
t3 2 5
t4 5 95
t5 94 95
Trimming Module
Trimmed dataset
I1 I2
t1 90 80
t2 80
t4 95
t5 94 95
Trimmed Dataset
The uncertain dataset is first trimmed to remove
the items with low existential probability.
21
Data Trimming Framework
Trimming phase
Mining phase
Kth - iteration
Pruning Module
Statistics
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
The uncertain database is first trimmed to remove
the items with low existential probability.
Mine the trimmed dataset.
22
Data Trimming Framework
Trimming phase
Mining phase
Patch-up phase
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The uncertain database is first trimmed to remove
the items with low existential probability.
The patch up phase to recover the true frequent
itemsets that are missed when mining the trimmed
dataset.
Mine the trimmed dataset.
23
Data Trimming Framework
Trimming phase
Mining phase
Patch-up phase
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
OVERHEAD
SAVING
OVERHEAD
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The performance of the trimming framework is
sensitive to the trimming threshold, which is
hard to be determined.
Saving can be counterproductive if there are few
low probability items.
Have to scan the original dataset at least once,
scanning the original dataset is an expensive
operation.
24
Data Trimming Framework
  • The main drawback of Data Trimming is that its
    performance is sensitive to the percentage of low
    probability items.

When there are few low probability items in the
dataset, data trimming can be counterproductive.
25
Decremental Pruning
  • Relatively insensitive to the percentage of low
    probability items in the dataset

26
Decremental Pruning
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
  • The decremental pruning technique achieves
    candidate reduction during the mining process.
  • Estimates the upper bounds of expected supports
    of the candidate itemsets progressively after
    each dataset transaction is processed.

27
Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Scan
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
The U-Apriori algorithm scans the entire dataset
to obtain the expected support of candidate
itemset a,b before it can be identified
infrequent.
28
Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Scan
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
In the decremental pruning technique, we
progressively estimate the upper bound of
Se(a,b). Before scanning a transaction, the
upper bound equal to the expected support of
singleton a.
Expected support count of a,
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6




In this estimation, we are assuming that the
existential probabilities of singleton b are
100 in all transactions.
29
Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
After scanning t1, we know that the existential
probability of b is 0.5. Previously, we assumed
that the existential probability of b is 1, but
it is now 0.5. Therefore, we can deduce that the
upper bound of Se( a,b ) has been overestimated
by 0.5.
Expected support count of a,
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1



30
Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
Since the existential probability of singleton
b in transaction t2 is 0.8, we have
overestimated the existential probability of b
by 1-0.80.2. Therefore, the upper bound of Se(
a,b ) is overestimated by 0.90.2 0.18.
a,b
Candidate
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92


31
Decremental Pruning
The beginning of the 2nd iteration
Subset Function
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
Large itemsets
Candidates
Apriori-Gen
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
Now, since the upper bound of Se(a,b) is less
than the minimum expected support count, a,b is
identified infrequent and can be pruned.
Upper bound of Se( a,b )
Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92


32
Decremental Pruning
The beginning of the 2nd iteration
Tid / Item a b c d
t1 1 0.5 0.3 0.2
t2 0.9 0.8 0.7 0.4
t3 0.3 0 0.9 0.7
t4 0.4 0.8 0.3 0.7
We denote as the
upper bound of Se( a,b ), assuming that the
item b has probability of 100 from transaction
tk to the last transaction.
Similarly, we can also assume that the item a
has probability of 100 in the entire dataset.
Minimum expected support count 2
Expected support count of a,
a,b
Candidate
Decremental Counters

Before any trans. is processed 2.6
After t1 is processed 2.6-(1(1-0.5)) 2.1
After t2 is processed 2.1-(0.9(1-0.8)) 1.92



2.1
2.1-(0.5(1-1)) 2.1
2.1-(0.8(1-0.9)) 2.036


33
Decremental Pruning
Decremental counter initialization and update
process.
Proof.
34
Decremental Pruning
  • The brute-force decremental counter method is
    infeasible because
  • Too many decremental counters to maintain.
  • Ineffective to prune the candidates one by one.

Level 0
Level 1
b,d
b,g
a,k
a,g
b,e

b,f

c,d

c,e

c,f

Candidate Itemset Expected Support Count Decremental Counters
a,b
a,g
35
Aggregate by singleton method (AS)
  • Aggregate by singleton method (AS)
  • Reduces the number of decremental counters to the
    number of frequent singletons.

Suppose there are three size-2-candidates
6 decremental counters to maintain in total.
Decremental Counters
Singleton Decremental Counters
Size-2 Candidates
36
Aggregate by singleton method (AS)
  • Aggregate by singleton method (AS)
  • Reduces the number of decremental counters to the
    number of frequent singletons.
  • If the counter ds(a,k) drops below the minimum
    support requirement, candidates that contain a
    must be infrequent and can be pruned.

Suppose there are three size-2-candidates
6 decremental counters to maintain in total.
Decremental Counters
Singleton Decremental Counters
Size-2 Candidates
37
Common prefix method (CP)
  • Another method is to aggregate the decremental
    counters with common prefix.
  • We only keep those decremental counters
    where X is a proper prefix of X.
    (i.e. )
  • We use a prefix decremental counter
    as an upper bound of ALL decremental
    counters and
    .

Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
38
Common prefix method (CP)
  • Another method is to aggregate the decremental
    counters with common prefix.
  • We only keep those decremental counters
    where X is a proper prefix of X.
    (i.e. )
  • We use a prefix decremental counter
    as an upper bound of ALL decremental
    counters and
    .

Many frequent itemset mining algorithms use a
prefix-tree like data structure to organize
candidates.
Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
c,e

39
Common prefix method (CP)
  • Candidates with common prefix are organized under
    the same subtree in a prefix tree (hash tree).
  • If the value of a prefix decremental counter
    drops below the minimum support requirement, all
    the candidates in the corresponding subtree can
    be pruned.

Many frequent itemset mining algorithms use a
prefix-tree like data structure to organize
candidates.
X
Candidate Itemset Decremental Counters
a,b
a,c
Candidate Itemset Decremental Counters
b,c
b,d
c,e

40
Common prefix method (CP)
  • Improve the pruning power of the CP method
  • Item with higher order will be the prefix of more
    candidates.

If this counter drops lower than the minimum
support count, we can prune 3 itemsets.
If this counter drops lower than the minimum
support count, we can prune 1 itemset only.
a,b a,c a,d
b,c b,d
c,d
Item a is the prefix of 3 candidates
Item c is the prefix of 1 candidate only.
41
Common prefix method (CP)
  • Our strategy is to reorder the items by their
    expected supports in ascending order
  • The decremental counters of items in higher
    orders will be more likely to be lower than them
    minimum support threshold.

If this counter drops lower than the minimum
support count, we can prune 3 itemsets.
If this counter drops lower than the minimum
support count, we can prune 1 itemset only.
a,b a,c a,d
b,c b,d
c,d
Item a is the prefix of 3 candidates
Item c is the prefix of 1 candidate only.
42
Details of the aggregation methods
Initialization and update of singleton
decremental counter
Initialization and update of prefix decremental
counter
43
Experiments and Discussions
44
Synthetic datasets
Step 1 Generate data without uncertainty. IBM
Synthetic Datasets Generator Average length of
each transaction (T 25) Average length of
frequent patterns (I 6) Number of transactions
(D 100K)
IBM Synthetic Datasets Generator
TID Items
1 2,4,9
2 5,4,10
3 1,6,7

Data Uncertainty Simulator
The proportion of items with low probabilities is
controlled by the parameter R (R75).
High probability items generator
Low probability items generator
Step 2 Introduce existential uncertainty to
each item in the generated dataset.
TID Items
1 2(90), 4(80), 9(30), 10(4), 19(25)
2 5(75), 4(68), 10(100), 14(15), 19(23)
3 1(88), 6(95), 7(98), 13(2), 18(7), 22(10), 25(6)
  • Assign relatively high probabilities to the items
    in the generated dataset.
  • Normal Distribution (mean 75, standard
    deviation 15)

Assign more items with relatively low
probabilities to each transaction. Normal
Distribution (mean 25, standard deviation
15)
45
Performance of Decremental Pruning
Both AS and CP reduce the candidate set in a
progressive way. CP reduces the candidate set by
about 20 after 60 of the transactions are
processed. The pruning power of CP is higher than
AS.
46
Performance of Decremental Pruning
Both AS and CP reduce the candidate set in a
progressive way. CP reduces the candidate set by
about 20 after 60 of the transactions are
processed. The pruning power of CP is higher than
AS.
The 2nd iteration is the computational
bottleneck. Although CP pruned twice as much as
AS, the CPU cost saving is not doubled because CP
requires a more complex recursive strategy to
maintain the prefix decremental counters.
47
Compare with Data Trimming
The performance of Data Trimming is very
sensitive to R. The performance of the
decremental approaches is relatively stable.
The combined method, which uses both CP and Data
Trimming strikes a good balance and gives
consistently good performance.
48
Vary minimum support threshold
CP performs slightly better than AS in a wide
range of support threshold.
A larger support implies the minimum support is
larger, it is easier for the decremental counters
to drop below the required value and more
candidates can be pruned early.
49
Conclusion
  • We study the problem of mining frequent itemsets
    under a probabilistic framework.
  • Data Trimming technique
  • Works well when there are substantial amount of
    items with low existential probabilities in the
    dataset.
  • Counterproductive when there are few low
    probability items in the dataset.
  • We proposed the decremental pruning technique
    that achieves candidate reduction during the
    mining process.
  • Experimental results show that the Decremental
    Pruning and Data Trimming techniques can be
    combined to yield the most stable and efficient
    algorithm.

50
Thank you!
51
Appendix
52
Data Trimming Framework
  • Direction
  • Try to avoid incrementing those insignificant
    expected support counts.
  • Save the effort for
  • Traversing the hash tree.
  • Computing the expected support count.
    (Multiplication of float variables)
  • The I/O for retrieving the items with very low
    existential probability.

53
Computational Issue
  • Preliminary experiment to verify the
    computational bottleneck of mining uncertain
    datasets.
  • 7 synthetic datasets with same frequent itemsets.
  • Vary the percentages of items with low
    existential probability (R) in the datasets.

1
2
3
4
5
6
7
0
33.33
50
60
66.67
75
71.4
54
Computational Issue
CPU cost in each iteration of different datasets
The dataset with 75 low probability items has
many insignificant support increments. Those
insignificant support increments maybe redundant.
7
75
This gap can potentially be reduced.
1
0
Although all datasets contain the same frequent
itemsets, U-Apriori requires different amount of
time to execute.
Iterations
55
Data Trimming Framework
Uncertain dataset
Trimmed dataset
Statistics
I1 I2
t1 90 80
t2 80 4
t3 2 5
t4 5 95
t5 94 95
I1 I2
t1 90 80
t2 80
t4 95
t5 94 95
Total expected support count being trimmed Maximum existential probability being trimmed
I1 1.1 5
I2 1.2 3
  • Create a trimmed dataset by trimming out all
    items with low existential probabilities.
  • During the trimming process, some statistics are
    kept for error estimation when mining the trimmed
    dataset.
  • Total expected support count being trimmed of
    each item.
  • Maximum existential probability being trimmed of
    each item.
  • Other information e.g. inverted lists,
    signature files etc

56
Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
57
Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
Trimmed Dataset
Uncertain Apriori
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
58
Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
59
Data Trimming Framework
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
Pruning Module
Statistics
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
60
Data Trimming Framework
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Kth - iteration
Pruning Module
Statistics
The potentially frequent itemsets are passed back
to the Uncertain Apriori algorithm to generate
candidates for the next iteration.
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Trimmed Dataset
Uncertain Apriori
Notice that, the infrequent itemsets pruned by
the Uncertain Apriori algorithm are only
infrequent in the trimmed dataset.
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
61
Data Trimming Framework
The pruning module uses the statistics gathered
from the trimming module to identify the itemsets
which are infrequent in the original dataset.
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
The trimmed dataset is then mined by the
Uncertain Apriori algorithm.
The potentially frequent itemsets are verified by
the patch up module against the original dataset.
62
Data Trimming Framework
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
63
Data Trimming Framework
There are three modules under the data trimming
framework, each module can have different
strategies.
What statistics are used in the pruning strategy?
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
Can we use a single scan to verify all the
potentially frequent itemsets or multiple scans
over the original dataset?
The trimming threshold is global to all items or
local to each item?
64
Data Trimming Framework
There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
  • To what extend do we trim the dataset?
  • If we trim too little, the computational cost
    saved cannot compensate for the overhead.
  • If we trim too much, mining the trimmed dataset
    will miss many frequent itemsets, pushing the
    workload to the patch up module.

65
Data Trimming Framework
  • The role of the pruning module is to estimate the
    error of mining the trimmed dataset.
  • Bounding techniques should be applied here to
    estimate the upper bound and/or lower bound of
    the true expected support of each candidate.

There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
  • To what extend do we trim the dataset?
  • If we trim too little, the computational cost
    saved cannot compensate for the overhead.
  • If we trim too much, mining the trimmed dataset
    will miss many frequent itemsets, pushing the
    workload to the patch up module.

66
Data Trimming Framework
  • The role of the pruning module is to estimate the
    error of mining the trimmed dataset.
  • Bounding techniques should be applied here to
    estimate the upper bound and/or lower bound of
    the true expected support of each candidate.

There are three modules under the data trimming
framework, each module can have different
strategies.
Kth - iteration
Pruning Module
Statistics
Potentially frequent itemsets
Frequent Itemsets in the original dataset
Patch Up Module
Potentially Frequent k-itemsets
Trimming Module
Infrequent k-itemsets
Frequent itemsets in the trimmed dataset
Trimmed Dataset
Uncertain Apriori
  • To what extend do we trim the dataset?
  • If we trim too little, the computational cost
    saved cannot compensate for the overhead.
  • If we trim too much, mining the trimmed dataset
    will miss many frequent itemsets, pushing the
    workload to the patch up module.
  • We try to adopt a single-scan patch up strategy
    so as to save the I/O cost of scanning the
    original dataset.
  • To achieve this strategy, the potentially
    frequent itemsets outputted by the pruning module
    should contains all the true frequent itemsets
    missed in the mining process.
Write a Comment
User Comments (0)
About PowerShow.com