Data Mining Tutorial

About This Presentation

Title:

Data Mining Tutorial

Description:

If not, at least we have pruned all its supersets. Jump ahead schemes: Bayardo's MaxMine ... Downward closure (frequent sets) is a pruning property ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 93

Provided by: yinpe

Learn more at: https://people.cs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining Tutorial

1
Data Mining Tutorial

Tomasz Imielinski
Rutgers University

2
What is data mining?

Finding interesting, useful, unexpected
Finding patterns, clusters, associations,
classifications
Answering inductive queries
Aggregations and their changes on
multidimensional cubes

3
Table of Content

Association Rules
Interesting Rules
OLAP
Cubegrades unification of association rules and
OLAP
Classification and Clustering methods not
included in this tutorial

4
Association Rules

AIS 1993 Agrawal, Imielinski, Swami Mining
Association Rules SIGMOD 1993
AS 1994 - Agrawal, Srikant Fast algortihms
for mining association rules in large databases
VLDB 94
B 1998 Bayardo Efficiently Mining Long
Patterns from databases Sigmod 98
SA 1996 Srikant, Agrawal Mining Quantitative
Association Rules in Large Relational Tables,
Sigmod 96
T 1996 Toivonen Sampling Large Databases for
Association Rules, VLDB 96
BMS 1997 Brin, Motwani, Silverstein Beyond
Market Baskets Generalizing Association Rules to
Correlations
IV 1999 Imielinski, Virmani MSQL A query
language for database mining DMKD 1999

5
Baskets

I1,Im a set of (binary) attributes called items
T is a database of transactions
tk 1 if transaction t bought item k
Association rule X gt I with support s and
confidence c
Support what fraction of T satisfies X
Confidence what fraction of X satisfies I

6
Baskets

Minsup. Minconf
Frequent sets sets of items X such that their
support sup(X) gt minsup
If X is frequent all its subsets are (closure
downwards)

7
Examples

20 of transactions which bought cereal and milk
also bought bread (support 2)
Worst case exponential number (in terms of size
of the set of items) of such rules.
What is the set of transactions which leads to
exponential blow up of the rule set?
Fortunately worst cases are unlikely not
typical. Support provides excellent pruning
ability.

8
General Strategy

Generate frequent sets
Get association rules XgtI and their confidence
and support as ssupport(XI) and confidence c
supportXI)/support(X)
Key property downward closure of the frequent
sets dont have to consider supersets of X if X
is not frequent

9
General strategies

Make repetitive passes through the database of
transactions
In each pass count support of CANDIDATE frequent
sets
In the next pass continue with frequent sets
obtained so far by expanding them. Do not
expand sets which were determined NOT to be
frequent

10
AIS Algorithm
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
11
AIS generating association rules
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
12
AIS estimation part
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
13
Apriori
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
14
Apriori algorithm
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
15
Pruning in apriori through self-join
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
16
Performance improvement due to Apriori pruning
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
17
Other pruning techniques

Key question At any point of time how to
determine which extensions of a given candidate
set are worth counting
Apriori only these for which all subsets are
frequent
Only these for which the estimated upper bound
of the count is above minsup
Take a risk count a large superset of the given
candidate set. If it is frequent than all its
subsets are also large saving. If not, at least
we have pruned all its supersets.

18
Jump ahead schemes Bayardos MaxMine
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
19
Jump ahead scheme

h(g) and t(g) head and tail of an item group.
Tail is the maximal set of items which g can be
possibly extended with

20
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
21
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
22
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
23
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
24
Max-miner vs Apriori vs Apriori LB

Max-miner is over two orders of magnitude faster
than apriori in identifying maximal frequent
patterns on data sets with long max patterns
Considers fewer candidate sets
Indexes only on head items
Dynamic item reordering

25
Quantitative Rules

Rules which involve contignous/quantitative
attributes
Standard approach discretize into intervals
Problem it is arbitrary, we will miss rules
MinSup problem if the number of intervals is
large their support will be low
MinConf problem if intervals are large rules may
not meet min confidence

26
Correlation Rules BMS 1997

Suppose the conditional probability that the
customer buys coffee given that he buys tea is
80, is this an important/interesting rule?
It dependsif apriori probability of a customer
buying coffee is 90, than it is not
Need 2x2 contingency tables rather than just pure
association rules. Chi-square test for
correlation rather than just support/confidence
framework which can be misleading

27
Correlation Rules

Events A and B are independent if p(AB) p(A) x
p(B)
If any of the AB, A(notB), (notA)B, (notA)(notB)
are dependent than AB are correlated likewise
for three items if any of the eight combinations
of A, B and C are dependent then A, B, C are
correlated
Ii1,in is correlation rule iff the
occurrences of i1,in are correlated
Correlation is upward closed if S is correlated
so is any superset of S

28
Downward vs upward closure

Downward closure (frequent sets) is a pruning
property
Upward closure minimal correlated itemsets,
such that no subsets of them are correlated. Then
finding correlation is a pruning step prune all
the parents of a correlated itemset because they
are not minimal.
Border of correlation

29
Pruning based on support-correlation

Correlation can be additional pruning criterion
next to support
Unlike support/confidence where confidence is not
upward closed

30
Chi-square
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
31
Correlation Rules
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
32
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
33
Algorithms for Correlation Rules

Border can be large, exponential in terms of the
size of the item set need better pruning
functions
Support function needs to be defined but also for
negative dependencies
A set of items S has support s at the p level if
at least p of the cells in the contingency table
for S have value s
Problem (plt50 all items have support at the
level one)
For p gt 25 at least two cells in the contingency
table will have support s

34
Pruning

Antisupport (for rare events)
Prune itemsets with very high chi-square to
eliminate obvious correlations
Combine chi-squared correlation rules with
pruning via support
Itemset is significant iff it is supported and
minimally correlated

35
Algorithm ?2-support

INPUT A chi-squared significance level ?,
support s, support fraction p gt 0.25.
Basket data B.
OUTPUT A set of minimal correlated itemsets,
from B.
For each item , do count O(i). We can
use these values to calculate any necessary
expected value.
Initialize
For each pair of items such that
and , do add
to
.
5. If is empty, then return SIG
and terminate.
For each itemset in , do construct
the contingency table for the itemset. If less
than
p percent of the cells have count s, then
goto Step 8.
7. If the value for contingency table is
at least , then add the itemset to SIG,
else add the items to NOTSIG.
Continue with the next itemset in .
If there are no more itemsets in ,
then set to be the set of all
sets S such that every subset of size S - 1 is
not .
Goto Step 4.

(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
36
Sampling Large Databases for Correlation Rules
T1996

Pick a random sample
Find all association rules which hold in that
sample
Verify the results with the rest of the database
Missing rules can be found in the second pass

37
Key idea more detail

Find a collection of frequent sets in the sample
using lower support threshold. This collection is
likely to be a superset of the frequent sets in
entire database
Concept of negative border minimal sets which
are not in a set collection S

38
Algorithm
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
39
Second pass

Negative border consists of the closest
itemsets which can be frequent too
These have to be tried (measured)

40
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
41
Probabilty that a sample s has exactly c rows
that contain X
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
42
Bounding error
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
43
Approximate mining
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
44
Approximate mining
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
45
Summary

Discover all frequent sets in one pass in a
fraction of 1-D of the cases when D is given by
the user missing sets may be found in second
pass

46
Rules and whats next?

Querying rules
Embedding rules in applications (API)

47
MSQL
(T. Imielinski, A. Virmani, MSQL A Query
Language for Database Mining, Data Mining and
Knowledge Discovery 3, 99)
48
MSQL
(T. Imielinski, A. Virmani, MSQL A Query
Language for Database Mining, Data Mining and
Knowledge Discovery 3, 99)
49
Applications with embedded rules (what are rules
good for)

Typicality
Characteristic of
Changing patterns
Best N
What if
Prediction
Classification

50
OLAP

Multidimensional queries
Dimensions
Measures
Cubes

51
Data CUBE
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
52
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
53
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
54
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
55
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
56
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
57
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
58
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
59
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
60
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
61
Monotonicty

Iceberg Queries
COUNT, MAX, SUM etc allow pruning
AVG does not AVG of a cube extension can be
larger or smaller than the AVG over the original
cube thus no pruning in the apriori sense

62
Examples of Monotonic Conditions

MAX, MIN
TOP-k AVG

63
Cubegrades combining OLAP and association rules

Consider rule milk, buttergt bread s100,
C75.
Consider it as a gradient or derivative of a
cube.
Body 2d-cube in multidimensional space
representing transactions where milk and butter
are bought together.
Consequent Represents the specialization of
body cube by bread. Bodyconsequent
represents subcube where milk, butter and bread
are bought together.
Support COUNT of records in body cube.
Confidence measures how COUNT is affected when
we specialize body cube by consequent.

64
A Different Perspective

Consider rule milk, buttergt bread s100,
C75.
Consider it as a gradient or derivative of a
cube.
Body 2d-cube in multidimensional space
representing transactions where milk and butter
are bought together.
Consequent Represents the specialization of
body cube by bread. Bodyconsequent
represents subcube where milk, butter and bread
are bought together.
Support COUNT of records in body cube.
Confidence measures how COUNT is affected when
we specialize body cube by consequent.

65
Cubegrades Generalization of Association Rules

We can generalize this in two ways.
Allow additional operators for cube
transformation including specializations,
generalization and mutations.
Allow additional measures such as MIN, MAX, SUM,
etc.
ResultCubegrades
entities that describe how transforming source
cube X to target cube Y affects a set of measure
values.

66
Mathematical Similarity

Similar to function gradient measures how
changes in function argument affects the function
value.
Cubegrade measures how changes in cube affects
measure (function) values.

67
Using cubegrades Examples

Data description Monthly summaries of item sales
per customer customer demographics.
Examples
How is the average amount of milk bought affected
by different age categories among buyers of
cereals?
What factors cause the average amount of milk
bought to increase by more than 25 among
suburban buyer?
How do buyers in rural cubes compare with buyers
in suburban cubes in terms of the average amount
spent on bread milk and cereal?

68
Cubegrade lingo

Consider the following cube
areaTypeurban, Age25,35 (Avg(salesMilk)25)
Descriptor attribute-value pair.
K-Conjunct Conjunct of k-descriptors
Cube set of objects in a database that satisfy
the k-conjunct.
Dimensions The attributes used in the
descriptor.
Measures Attributes that are aggregated over
objects.

69
Cubegrade Definition

Mathematically, a cubegrade is a 5-tuple ltSource,
Target, Measures, Values, Delta-Valuegt
Source The source or initial cube.
Target Target cube obtained by applying factor F
on source. Target Source Factor.
Measures set of measures evaluated.
Values function evaluating a measure in source.
Delta-Value function evaluating the ratio of
measure value in target cube versus measure value
in source cube.

70
Cubegrade Example
Source cube
Target cube
areaTypeurban-gt areaTypeurban,
Age25,35 (Avg(salesMilk), Avg(salesMilk)25,
DeltaAvg(salesMilk)125)
Measure
Value
Delta Value
71
Types of cubegrades
Generalize on C
Mutate C to c2
Specialize by D
Aa1, Bb1, Cc1, Dd1
72
Querying cubegrades.

CubeQL (for querying cubes) and CubegradeQL(for
querying cubegrades).
Features.
SQL-like, declarative style.
Conditions on Source cube and target cube.
Conditions on measure values and delta values.
Join conditions between source and target.

73
How, which and what
(A. Abdulgani, Ph.D. Thesis, Ruthers University
2000)
74
The Challenge

Pruning was what made association rules
practical.
Computation was bottom-up. If a cube doesnt
satisfy the support threshold, no subcube would
satisfy the support threshold.
COUNT is no longer the sole constraint. New
additional constraints.

75
Assumptions

Dealing with the SQL aggregate measures MIN,MAX,
SUM, AVG.
Each constraint is of the form AGG(X)gt,lt, c,
where c is a constant.

76
Monotonicity

Consider a query Q, a database D and a cube X in
D.
Query Q is monotonic if the condition

Q(X) is FALSE in D, where X?X
Q(X) is FALSE in D
77
View Monotonicity

Alternatively, define a cubes view as projection
of the measure and dimension values holding on
the cube.
A view is not tied to a particular cube or
database.
Q is monotonic for view V, if the condition

For any cube X in any D s.t. V is a view for X,
Q(X) is FALSE
Q(X) is FALSE, where X ? X
78
GBP Sketch

Grid Construction for input query
Axes defined on dimension/measure attributes used
in query.
Axis intervals based on constants used in query.
Cartesian product of intervals define individual
cells.
Query evaluation for each cell.

MAX(X)
F
T
T
150
T
F
T
50
F
F
T
0
25
50
AVG(X)
79
Checking for satisfiability

Cell C defined by
mL ? MIN(A) ? mH
ML? MAX(A) ? MH
AL ? AVG(A) ? AH
SL? SUM(A) ? SH
CL ? COUNT() ? CH

Reduce to the system
(N-1)mLML ? S ?(N-1)MHmH
SL ? S ? SH
ALN ? S ? AHN
CL ? N ? CH

Solve for N and check the interval returned for
N.
For measures on multiple attributes solve
independently for distinct attributes. Check for
a common shared interval for N.

80
View Reachability
MAX(X)
Question Is there a cube X with view V s.t. X
has a subcube which falls in a TRUE cell? Is a
TRUE cell C reachable from V?
F
T
T
150
T
F
T
V
50
F
F
T
0
25
50
AVG(X)
81
Defining View Reachability

A cell C defined by
mL?MIN(A) ? mH
ML ? MAX(A) ? MH
AL ? AVG(A) ? AH
SL ? SUM(A) ? SH
CL ? COUNT() ? CH

A view V defined by
MIN(A)m
MAX(A)M
AVG(A)a
SUM(A)?
COUNT(A)c

Cell C is reachable from view V if there is a
set X of X1, X2, .. XN, .. XC real elements
which satisfies the view constraints and a subset
X of X1, X2, .. XN which satisfies the cell
constraints.

82
Checking for View Reachability

View Reachability on measures of a single
attribute can be reduced to at most 4 systems
with constant number of linear constraints on N.
For measures on multiple distinct attributes,
obtain set of intervals on every attribute
separately. V is reachable from C if there is a
shared interval obtained on N containing an
integral point.

83
Example

Consider view of 19 records XX1, , X19 with
MIN(X)0, MAX(X)75, SUM(X)1000.
Let C be defined by
CL, CH1, 19, mL, mH0,10, ML,
MH0,50, AL, AH46.5, 50.
C is reachable from V either with N12 or with
N15.

84
Complexity Analysis

Let Q be a query in disjunctive normal form
consisting of m conjuncts in J dimensions and K
distinct measure attributes.
The monotonicity of Q for a given view can be
tested in O(m(JKlogK)) time.

85
Computing cubegrades

Algorithm Cubegrade Gen Basic
Evaluate Qsource
For each S in Qsource
Evaluate QS
For each T in QS
Form the cubegrade ltS, T, Measure, Values, Delta
Valuesgt where Delta Values have to be calculated
as ratios of the Measure evaluated on the target
and on the source cubes respectively.

86
Cube and Cubegrade query classes

Cube Query classification
Queries with strong monotonicity.
Queries with weak monotonicity.
Hopeless queries.
Cubegrade query classification, based on source
cube query classification and target cube
classification
Focused.
Weakly focused.
Hopeless.

87
Cubegrade Application Development

Cubegrades are not end products. Rather, an
investment to drive a set of applications.
Definition of an application framework for
cubegrades. Features include
Extension of Dmajor datamining platform.
Generation, storage and retrieval of cubegrades.
Accessing internal components of cubegrades for
browsing, comparisons and modifications.
Traversals through a set of cubegrades.
Primitives for correlating cubegrades with
underlying data and vice versa.

88
Application Example Effective Factors

Find factors which are effective in changing a
measure value m for a collection of cubes by a
significant ratio.
Factor F is effective for C iff for all
GltC,CF,m,V,Deltagt where C? C it holds that
Delta(m)gt(1x) or Delta(m)lt(1-x).

89
Cubegrades and OLAP
90
Future work

Extending GBP to cover additional constraint
types.
Monotonicity threshold of a query.
Domain Specific Application Gene Expression
Mining.

91
Summary

Cubegrade concept as a generalization of
association rules and cubes.
Concept of querying of cubes and cubegrades.
Description of a GBP method for efficient pruning
of queries with constraints of type Agg(a) ?,?
c, where Agg() can be MIN(), MAX(), SUM(), AVG().
Experimentally through a cubegrade engine
prototype shown the viability of GBP and the
cubegrade generation process.
Classification of a hierarchy of query classes
based on theoretical pruning characteristics.
Presentation of a framework for developing
cubegrade applications.

92
Conclusions

OLAP and Association rules really one approach
Key problem - the set of rules, cubegrades can
be orders of magnitude larger than the source
data set
Hence, the key issue is how do we present/use the
obtained rules in applications which provide real
value for the user
Discovery as querying

Write a Comment

User Comments (0)