Title: Mining Rank-Correlated Sets Of Numerical Attributes.
1Mining Rank-Correlated Sets Of Numerical
Attributes.
2Index
- Introduction.
- Need for finding new measures for mining
numerical datasets. - problems with previous association rules with
discretization methods. - Discretization example and its disadvantages.
- Proposed Methods and concepts.
- Rank and Rank correlation measures.
- What is Rank and Rank correlation?
- Introducing New Rank correlation Measures.
- Observations.
- Defining the support for new measures.
- Example Dataset.
- Calculating the Rank correlation measure values
on the example dataset. - Properties of the new Rank correlation measures.
- Categorical Attribute combined with Numerical
attributes. - Explaining the new measures on categorical and
mixed data. - Defining confidence and Association Rules with
new support measures. - Example and explanation.
- Finding Ranks with attribute value ties in the
dataset.
3Introduction
-
- The motivation for the research reported upon in
this paper is an application where we want to
mine frequently occurring patterns in a
meteorological dataset containing measurements
from various weather stations in Belgium over the
past few years. Each record contains a set of
measurements (such as temperature or pressure)
taken in a given station at a given time point,
together with extra information about the
stations (such as location or altitude). - The classical association rule framework 1,
however, is not adequate to deal with numerical
data directly. Most previous approaches to
association rule mining for numerical attributes
were based on discretization. -
- For mining the rank correlated numerical
datasets three new support measures are proposed.
Their application on the real time meteorological
dataset and according performance evaluations are
also presented.
4Discretization and Concept hierarchy
- Discretization
- reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values - Concept hierarchies
- reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior) - E.g. If you have a attribute Age with values
(22,23,24,35,36,37,55,56,57).Applying
discretization and breaking this into intervals
we get, - Age (Young(2035), MiddleAge(35.55),
Old(55.80)) - here ,
- Young gives you (12,13,24)
- MiddleAge (35,36,37)
- Old (55,56,57)
5Discretization Example
Table People
RecordID Age Married Status NumCars
100 23 NO 1
200 25 YES 1
300 29 NO 0
400 34 YES 2
500 38 YES 2
Minimum Support 40 ,Minimum Confidence 50
Rules Support Confidence
(Age 30..39) and (married Yes) gt NumCars 2 40 100
(NumCars 0.1) gt (Married No) 40 66.66
6- The above simple approach (discretization) to
the quantitative database has two major
drawbacks - MinSup
- If the number of intervals for a
quantitative attribute is large, the support for
any single interval can be low. Hence without
using larger intervals ,some rules involving this
attribute may not be found because they lack
minimum support. - MinConf
- There is some information lost whenever we
partition values into intervals. Some rules may
have minimum confidence only when an item in the
antecedent consists of a single value (or a small
interval). This information loss increases as the
interval sizes become larger. - For example, in Figure 2, the rule
-
- (NumCars
O) gt (Married No) has 100 confidence and 20
Support. - But if we had partitioned the attribute
NumCars into intervals such that 0 and 1 cars end
up in the same partition, then the closest rule
is -
- (NumCars O.. 1) gt (Married
No), which only has 66.67 confidence and 60
support.
7Disadvantages of Discretization
- Discretization disadvantages can be listed as
- First of all it always incurs an information
loss, since values falling in the same bucket
become indistinguishable and small differences in
attribute value become unnoticeable. - On the other hand, very small changes in values
close to a discretization border may cause
unjustifiably large changes in the set of active
rules. - If there are too many Discretization intervals,
discovered rules are replicated in each interval
making the overall trends hard to spot. - It is also possible that rules will fail to meet
the minimum support criterion when they are split
among many narrow intervals.
8What is Rank and Rank Correlation measures?
Rank Let D be the database
and H be its header i.e. set of attributes. The
rank of a value of a numerical attribute in A
(t.A), is the index of the value of the attribute
in the database when its ordered ascending w.r.t
A. That is the smallest value of t.A in A gets a
rank 1,etc.Consider following table,
Person Height Weight
A 5.5 53
B 5.7 55
C 5.8 45
D 5.9 50
E 6.0 58
F 6.1 65
G 6.2 68
H 6.3 60
Person A B C D E F G H
Rank by height 1 2 3 4 5 6 7 8
Rank by weight 3 4 1 2 5 7 8 6
Same table represented with Rank Values.
Sample Table Person
9Rank Correlation
- Rank correlation is the study of relationships
between different rankings on the same set of
items. It deals with measuring correspondence
between two rankings, and assessing the
significance of this correspondence. Calculating
the Rank Correlation Coefficients is described in
the next section. -
- Assumption For most part of this paper it is
assumed that all values of numerical attributes
are Distinct .
10Rank correlation measures
Kendalls - Spearmans F Spearmans ?
1.Observatios for ? and F
When two attributes are highly
positively correlated, the ranks of their values
will be close within most tuples. Hence the sums
?t e Dr (t.A) r(t.B))2 and ?t e Dr (t.A)
r(t.B)) will be small if A and B are correlated.
On the other hand, if A and B are uncorrelated,
the expected absolute difference between the
ranks of A and B is (D - 1)/2.resulting in very
high values for these sums. The formulas are
chosen in such a way that they are 1 for total
positive correlation (e.g. AB) and -1 for total
negative correlation (e.g. A and B are in reverse
order) 2.Observations for -
The formula actually gives you
the probability that a randomly chosen pair of
tuples (s, t), s.A lt t.A and s.B lt t.B. In this
case too the formula is chosen such that it is 1
for total positive correlation and -1 for total
negative correlation. Due to this connection to
the probability this formula is natural than the
other two.
11Measuring the support for Numerical attributes
- By using the above coefficients following support
measures are proposed.
-
where I is set of
numerical attributes. supp? and suppF
aggregating the difference between the maximum
and minimum rank values. supp-, all pairs of
tuples (s,t) are taken and the number of pairs
such that s comes before t for all attributes.
This number is then divided by the maximal number
of such possible pairs.
12Example Weather Dataset
Consider the following example Weather dataset
Where, W Wind Speed, P
Pressure, T Temperature, H
Humidity. The numbers in bracket indicate the
rank of the value within the attribute.
- In order to count the supp- (T , P , H ), we
need to count the number of pairs of tuples such
that the first tuple is smaller than the second
tuple in T, P and H. e.g. the pair (t5 , t1) is
such a pair, and (t2 , t3) is not. As there are 6
pairs of tuples that are in order (t5 , t3) (t5
, t2) (t5 , t1) and t4 , t3) (t4 , t3) (t4 ,
t3), supp- is 6/10. - supp- is very complicated to compute.
13Properties
1. Average supp-
Let A1,A2,A3,A4.Ak be k statistically
independent numerical attributes. Then the
expected or average value of the supp- is given
by,
2. Support for Kendalls formula is never
greater than the support calculated by Spearmans
Footrule.
The property Support for a super set can not be
greater than its sub set holds true for those
support measures too. This property can be stated
as
0 lt suppm (J) lt suppm (I) lt 1
This property can be used to calculate the upper
bound for supp-, because in contrast to the supp-
itself it is easy to compute.
14Categorical attributes
- The above definitions of support are also
applicable to ordinal attribute as their domains
are ordered. - Integrating categorical attributes with numerical
attributes - Categorical attributes will be used to define the
context in which the support of numerical
attribute is computed.
Example We have attributes related to weather
data. Then a sample rule, In
measurements of Antwerp and Brussels, the
temperature and wind speed in Antwerp are higher
then in Brussels in 70 of the cases Can be
given as,
15Association rules
- As we have defined the support for categorical
attributes we can easily define the confidence
for the rule.
The interpretation of the
above association rule with confidence c is, for
all pairs of tuples (t1 ,t2)with t1 satisfying
J1, and t2 satisfying J2, if t1 is smaller than
t2 in I1,then there is a chance with confidence c
that t1 is also smaller than t2 in I2
Example Consider the rule,
In Antwerp if the wind speed in one measurement
is greater than in another measurement ,then
there is a probability of 65 that
the cloudiness will be higher as well.
16Measuring Ranks with ties
-
- At the start of this presentation we had a
assumption that all numerical values of an
attribute are Distinct. That wont be the case in
the real time data. -
- This happens with the Ordinal data too. e.g
an attribute Cloudiness that can take values
(low, medium, High). For this attribute at least
1/3rd of the pairs will be tied. - Modifying Support Measures
-
- supp-
- For our support measure supp- , we can
just keep our original definition that is, if
two tuples have a tie on an attribute of I, they
wont contribute to suppt (I). - Supp? SuppF
- i.e., suppose a value
t.A is repeated m times in attribute A, and there
are p tuples with an A-value smaller than t.A.
Then, the rank of value t.A can have any of the
following values -
- 1.In the range of p
1, . . . , p m. -
- 2.The average art.A
(p 1 m)/2 - 3.The maximal rank
r.A p m.
17For rankings with ties Supp?(I) SuppF(I)
- The definition of supp- does not change
- For ranks with no ties these formulas get reduced
to the original ones.
18Algorithms
- 1. As only ranks are compared instead of the
actual values in the database, we first transform
the database by replacing every attribute value
by its rank w.r.t. this attribute, which can be
done in O(HD log (D)) time. -
-
- 2. Generate all sets of attributes
satisfying the minimum support threshold, - 3. In the case of supp., or suppF , it can
be done by the standard level-wise search in
which candidate itemsets are generated only when
all of its subsets are known to be frequent.
Counting the supports requires only a single scan
through the data in every iteration. - 4. In the case that supp- is used as support
measure, the mining task becomes inherently more
difficult since the number of pairs of
transactions is quadratic in the size of the
database.
19Explicit creation of pairs of transactions
- A brute force solution to the presented mining
problem, is to explicitly combine all pairs of
transactions from the original database D, and
create a Boolean database D', such that
- With this new database we can
apply the general methods for finding frequent
itemsets. So the problem is reduced to standard
frequent itemset mining on the new database D. - Disadvantages
- 1.Size of the new
database D ,it is quadratic in the size to the
original database. - 2. Time needed to
create it. - 3. So it can be
applied to only small databases. - The advantage it is most of the
time very efficient in combination with
depth-first frequent itemset mining algorithms
because all pairs that do not support an itemset
will be removed from its branch Such fine level
of pruning will not be possible in the other
approaches.
20Using Spatial Indices
Spatial Indices
These are data structures especially designed to
allow for searching in multidimensional data and
to allow searches in several dimensions
simultaneously. When the
database is too large and the brute force method
is not applicable, we need an efficient mechanism
to count supp- for all candidate itemsets. For a
given attribute set I, this comes down to the
following nested for
- Disadvantage
- 1.The double loop is
not feasible as it performs a quadratic operation
for each candidate set.If we can replace this
inner for loop by some other search technique to
find out the smaller values in I this would
become an efficient technique.
21Optimization by using support property of supp-
and suppF
- Recall the property for the newly
found support measures supp- and suppF, - As suppF can be computed a
lot more efficiently compared to suppT , we will
first compute it for every candidate itemset. In
case it gives a value under the minimum support
threshold, we can already prune the itemset from
the search space without computing the more
expensive supp- . - After generating the
frequent itemset our next task is to prune the
itemsets which are not in the support
constraints. - Generating all candidate
itemsets in reverse order using Reverse Depth
First ,gives us the best result because, it is
guaranteed that all subsets of a given candidate
itemset are generated before the candidate
itself. - E.G. the subsets of a ,b , c, d will
be generated in the following manner, -
d
a -
c
a, d -
c, d
a, c -
b
a, c, d -
b, d
a, b
22Experimental Evaluation
- Rules found in the weather data
- In the table, the altitude of the sun
(solar alt), the amount of precipitation
(precip), the amount of cloudiness (cloud), and
the wind speed (w speed). -
-
- Association
rules found in weather data of one station in
Brussels. -
- The first rule simply claims that in 66 of
the cases, the temperature is higher if the
altitude of the sun is higher. The second rule
indicates that if it is raining harder, then
there is probability of 64 that the cloudiness
is bigger as well. Intuitively, one would expect
these confidences would be higher. - The last three rules indicate that higher
cloudiness or higher wind speed positively
influence the amount of rain, and apparently,
with both higher cloudiness and wind speed this
effect becomes even stronger. Indeed for two
randomly selected tuples, the probability of
having a higher precipitation is only 32. - When, however, we know that the cloudiness in the
first tuple is higher, or the wind speed in the
first tuple is higher, or both, the probability
of the first tuple having a higher precipitation
increases to respectively 48, 44, and 57.
23Performance evaluation of supp-
- Data and Resources Used
-
-
1.Real life meteorological dataset and a number
of benchmark dataset. -
-
2.All algorithms implemented in C programming
language. -
-
3.Tested on 2.2 GHz Opteron system with 2GB RAM. -
-
4.supp- is more applied than the other two
measures. - Observations
-
1.Implementation of Explicit pairs ran out of
memory for 1000 records. -
-
2.Kd-trees incur almost no memory overhead and
fits in main memory. -
3.Range trees uses a lot of memory and scaled
hundreds of thousands of
-
records depending on t he minsup parameter.
24Performance of suppp and suppF
- Mining patterns in both these
cases is much more efficient than in supp-. But
both these methods are very sensitive to minimum
support level chosen. e.g. for suppp the value of
minimum support must be very low, in the range of
tenth of millionth of percent. It shows that
theoretical maximum value is rarely achieved. - Conclusions
- 1.Three new
support measures for continuous attributes based
on rank methods - are
introduced. -
- 2.
Discretization required previously for numerical
data is removed. -
- 3.
Algorithms are presented for mining frequent
itemsets based on those -
definitions. - 4. Tests
results performed on real world meteorological
data are presented and - the
efficiency of the algorithms have been verified
on large databases.
25Questions