Mining Rank-Correlated Sets Of Numerical Attributes. - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Rank-Correlated Sets Of Numerical Attributes.

Description:

Counting the supports requires only a single scan through the data in every iteration. ... The advantage: it is most of the time very efficient in combination ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 26
Provided by: stude76
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Rank-Correlated Sets Of Numerical Attributes.


1
Mining Rank-Correlated Sets Of Numerical
Attributes.
2
Index
  • Introduction.
  • Need for finding new measures for mining
    numerical datasets.
  • problems with previous association rules with
    discretization methods.
  • Discretization example and its disadvantages.
  • Proposed Methods and concepts.
  • Rank and Rank correlation measures.
  • What is Rank and Rank correlation?
  • Introducing New Rank correlation Measures.
  • Observations.
  • Defining the support for new measures.
  • Example Dataset.
  • Calculating the Rank correlation measure values
    on the example dataset.
  • Properties of the new Rank correlation measures.
  • Categorical Attribute combined with Numerical
    attributes.
  • Explaining the new measures on categorical and
    mixed data.
  • Defining confidence and Association Rules with
    new support measures.
  • Example and explanation.
  • Finding Ranks with attribute value ties in the
    dataset.

3
Introduction
  • The motivation for the research reported upon in
    this paper is an application where we want to
    mine frequently occurring patterns in a
    meteorological dataset containing measurements
    from various weather stations in Belgium over the
    past few years. Each record contains a set of
    measurements (such as temperature or pressure)
    taken in a given station at a given time point,
    together with extra information about the
    stations (such as location or altitude).
  • The classical association rule framework 1,
    however, is not adequate to deal with numerical
    data directly. Most previous approaches to
    association rule mining for numerical attributes
    were based on discretization.
  • For mining the rank correlated numerical
    datasets three new support measures are proposed.
    Their application on the real time meteorological
    dataset and according performance evaluations are
    also presented.

4
Discretization and Concept hierarchy
  • Discretization
  • reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals. Interval labels can
    then be used to replace actual data values
  • Concept hierarchies
  • reduce the data by collecting and replacing low
    level concepts (such as numeric values for the
    attribute age) by higher level concepts (such as
    young, middle-aged, or senior)
  • E.g. If you have a attribute Age with values
    (22,23,24,35,36,37,55,56,57).Applying
    discretization and breaking this into intervals
    we get,
  • Age (Young(2035), MiddleAge(35.55),
    Old(55.80))
  • here ,
  • Young gives you (12,13,24)
  • MiddleAge (35,36,37)
  • Old (55,56,57)

5
Discretization Example
Table People
RecordID Age Married Status NumCars
100 23 NO 1
200 25 YES 1
300 29 NO 0
400 34 YES 2
500 38 YES 2
Minimum Support 40 ,Minimum Confidence 50
Rules Support Confidence
(Age 30..39) and (married Yes) gt NumCars 2 40 100
(NumCars 0.1) gt (Married No) 40 66.66
6
  • The above simple approach (discretization) to
    the quantitative database has two major
    drawbacks
  • MinSup
  • If the number of intervals for a
    quantitative attribute is large, the support for
    any single interval can be low. Hence without
    using larger intervals ,some rules involving this
    attribute may not be found because they lack
    minimum support.
  • MinConf
  • There is some information lost whenever we
    partition values into intervals. Some rules may
    have minimum confidence only when an item in the
    antecedent consists of a single value (or a small
    interval). This information loss increases as the
    interval sizes become larger.
  • For example, in Figure 2, the rule
  • (NumCars
    O) gt (Married No) has 100 confidence and 20
    Support.
  • But if we had partitioned the attribute
    NumCars into intervals such that 0 and 1 cars end
    up in the same partition, then the closest rule
    is
  • (NumCars O.. 1) gt (Married
    No), which only has 66.67 confidence and 60
    support.

7
Disadvantages of Discretization
  • Discretization disadvantages can be listed as
  • First of all it always incurs an information
    loss, since values falling in the same bucket
    become indistinguishable and small differences in
    attribute value become unnoticeable.
  • On the other hand, very small changes in values
    close to a discretization border may cause
    unjustifiably large changes in the set of active
    rules.
  • If there are too many Discretization intervals,
    discovered rules are replicated in each interval
    making the overall trends hard to spot.
  • It is also possible that rules will fail to meet
    the minimum support criterion when they are split
    among many narrow intervals.

8
What is Rank and Rank Correlation measures?
Rank Let D be the database
and H be its header i.e. set of attributes. The
rank of a value of a numerical attribute in A
(t.A), is the index of the value of the attribute
in the database when its ordered ascending w.r.t
A. That is the smallest value of t.A in A gets a
rank 1,etc.Consider following table,
Person Height Weight
A 5.5 53
B 5.7 55
C 5.8 45
D 5.9 50
E 6.0 58
F 6.1 65
G 6.2 68
H 6.3 60
Person A B C D E F G H
Rank by height 1 2 3 4 5 6 7 8
Rank by weight 3 4 1 2 5 7 8 6
Same table represented with Rank Values.
Sample Table Person
9
Rank Correlation
  • Rank correlation is the study of relationships
    between different rankings on the same set of
    items. It deals with measuring correspondence
    between two rankings, and assessing the
    significance of this correspondence. Calculating
    the Rank Correlation Coefficients is described in
    the next section.
  • Assumption For most part of this paper it is
    assumed that all values of numerical attributes
    are Distinct .

10
Rank correlation measures
Kendalls - Spearmans F Spearmans ?
1.Observatios for ? and F
When two attributes are highly
positively correlated, the ranks of their values
will be close within most tuples. Hence the sums
?t e Dr (t.A) r(t.B))2 and ?t e Dr (t.A)
r(t.B)) will be small if A and B are correlated.
On the other hand, if A and B are uncorrelated,
the expected absolute difference between the
ranks of A and B is (D - 1)/2.resulting in very
high values for these sums. The formulas are
chosen in such a way that they are 1 for total
positive correlation (e.g. AB) and -1 for total
negative correlation (e.g. A and B are in reverse
order) 2.Observations for -
The formula actually gives you
the probability that a randomly chosen pair of
tuples (s, t), s.A lt t.A and s.B lt t.B. In this
case too the formula is chosen such that it is 1
for total positive correlation and -1 for total
negative correlation. Due to this connection to
the probability this formula is natural than the
other two.
11
Measuring the support for Numerical attributes
  • By using the above coefficients following support
    measures are proposed.


where I is set of
numerical attributes. supp? and suppF
aggregating the difference between the maximum
and minimum rank values. supp-, all pairs of
tuples (s,t) are taken and the number of pairs
such that s comes before t for all attributes.
This number is then divided by the maximal number
of such possible pairs.
12
Example Weather Dataset
Consider the following example Weather dataset
Where, W Wind Speed, P
Pressure, T Temperature, H
Humidity. The numbers in bracket indicate the
rank of the value within the attribute.
  • In order to count the supp- (T , P , H ), we
    need to count the number of pairs of tuples such
    that the first tuple is smaller than the second
    tuple in T, P and H. e.g. the pair (t5 , t1) is
    such a pair, and (t2 , t3) is not. As there are 6
    pairs of tuples that are in order (t5 , t3) (t5
    , t2) (t5 , t1) and t4 , t3) (t4 , t3) (t4 ,
    t3), supp- is 6/10.
  • supp- is very complicated to compute.

13
Properties
1. Average supp-
Let A1,A2,A3,A4.Ak be k statistically
independent numerical attributes. Then the
expected or average value of the supp- is given
by,
2. Support for Kendalls formula is never
greater than the support calculated by Spearmans
Footrule.
The property Support for a super set can not be
greater than its sub set holds true for those
support measures too. This property can be stated
as
0 lt suppm (J) lt suppm (I) lt 1
This property can be used to calculate the upper
bound for supp-, because in contrast to the supp-
itself it is easy to compute.
14
Categorical attributes
  • The above definitions of support are also
    applicable to ordinal attribute as their domains
    are ordered.
  • Integrating categorical attributes with numerical
    attributes
  • Categorical attributes will be used to define the
    context in which the support of numerical
    attribute is computed.

Example We have attributes related to weather
data. Then a sample rule, In
measurements of Antwerp and Brussels, the
temperature and wind speed in Antwerp are higher
then in Brussels in 70 of the cases Can be
given as,
15
Association rules
  • As we have defined the support for categorical
    attributes we can easily define the confidence
    for the rule.

The interpretation of the
above association rule with confidence c is, for
all pairs of tuples (t1 ,t2)with t1 satisfying
J1, and t2 satisfying J2, if t1 is smaller than
t2 in I1,then there is a chance with confidence c
that t1 is also smaller than t2 in I2
Example Consider the rule,
In Antwerp if the wind speed in one measurement
is greater than in another measurement ,then
there is a probability of 65 that
the cloudiness will be higher as well.

16
Measuring Ranks with ties
  • At the start of this presentation we had a
    assumption that all numerical values of an
    attribute are Distinct. That wont be the case in
    the real time data.
  • This happens with the Ordinal data too. e.g
    an attribute Cloudiness that can take values
    (low, medium, High). For this attribute at least
    1/3rd of the pairs will be tied.
  • Modifying Support Measures
  • supp-
  • For our support measure supp- , we can
    just keep our original definition that is, if
    two tuples have a tie on an attribute of I, they
    wont contribute to suppt (I).
  • Supp? SuppF
  • i.e., suppose a value
    t.A is repeated m times in attribute A, and there
    are p tuples with an A-value smaller than t.A.
    Then, the rank of value t.A can have any of the
    following values
  • 1.In the range of p
    1, . . . , p m.
  • 2.The average art.A
    (p 1 m)/2
  • 3.The maximal rank
    r.A p m.

17
For rankings with ties Supp?(I) SuppF(I)
  • The definition of supp- does not change
  • For ranks with no ties these formulas get reduced
    to the original ones.

18
Algorithms
  • 1. As only ranks are compared instead of the
    actual values in the database, we first transform
    the database by replacing every attribute value
    by its rank w.r.t. this attribute, which can be
    done in O(HD log (D)) time.
  • 2. Generate all sets of attributes
    satisfying the minimum support threshold,
  • 3. In the case of supp., or suppF , it can
    be done by the standard level-wise search in
    which candidate itemsets are generated only when
    all of its subsets are known to be frequent.
    Counting the supports requires only a single scan
    through the data in every iteration.
  • 4. In the case that supp- is used as support
    measure, the mining task becomes inherently more
    difficult since the number of pairs of
    transactions is quadratic in the size of the
    database.

19
Explicit creation of pairs of transactions
  • A brute force solution to the presented mining
    problem, is to explicitly combine all pairs of
    transactions from the original database D, and
    create a Boolean database D', such that
  • With this new database we can
    apply the general methods for finding frequent
    itemsets. So the problem is reduced to standard
    frequent itemset mining on the new database D.
  • Disadvantages
  • 1.Size of the new
    database D ,it is quadratic in the size to the
    original database.
  • 2. Time needed to
    create it.
  • 3. So it can be
    applied to only small databases.
  • The advantage it is most of the
    time very efficient in combination with
    depth-first frequent itemset mining algorithms
    because all pairs that do not support an itemset
    will be removed from its branch Such fine level
    of pruning will not be possible in the other
    approaches.

20
Using Spatial Indices
Spatial Indices
These are data structures especially designed to
allow for searching in multidimensional data and
to allow searches in several dimensions
simultaneously. When the
database is too large and the brute force method
is not applicable, we need an efficient mechanism
to count supp- for all candidate itemsets. For a
given attribute set I, this comes down to the
following nested for
  • Disadvantage
  • 1.The double loop is
    not feasible as it performs a quadratic operation
    for each candidate set.If we can replace this
    inner for loop by some other search technique to
    find out the smaller values in I this would
    become an efficient technique.

21
Optimization by using support property of supp-
and suppF
  • Recall the property for the newly
    found support measures supp- and suppF,
  • As suppF can be computed a
    lot more efficiently compared to suppT , we will
    first compute it for every candidate itemset. In
    case it gives a value under the minimum support
    threshold, we can already prune the itemset from
    the search space without computing the more
    expensive supp- .
  • After generating the
    frequent itemset our next task is to prune the
    itemsets which are not in the support
    constraints.
  • Generating all candidate
    itemsets in reverse order using Reverse Depth
    First ,gives us the best result because, it is
    guaranteed that all subsets of a given candidate
    itemset are generated before the candidate
    itself.
  • E.G. the subsets of a ,b , c, d will
    be generated in the following manner,

  • d
    a

  • c
    a, d

  • c, d
    a, c

  • b
    a, c, d

  • b, d
    a, b

22
Experimental Evaluation
  • Rules found in the weather data
  • In the table, the altitude of the sun
    (solar alt), the amount of precipitation
    (precip), the amount of cloudiness (cloud), and
    the wind speed (w speed).
  • Association
    rules found in weather data of one station in
    Brussels.
  • The first rule simply claims that in 66 of
    the cases, the temperature is higher if the
    altitude of the sun is higher. The second rule
    indicates that if it is raining harder, then
    there is probability of 64 that the cloudiness
    is bigger as well. Intuitively, one would expect
    these confidences would be higher.
  • The last three rules indicate that higher
    cloudiness or higher wind speed positively
    influence the amount of rain, and apparently,
    with both higher cloudiness and wind speed this
    effect becomes even stronger. Indeed for two
    randomly selected tuples, the probability of
    having a higher precipitation is only 32.
  • When, however, we know that the cloudiness in the
    first tuple is higher, or the wind speed in the
    first tuple is higher, or both, the probability
    of the first tuple having a higher precipitation
    increases to respectively 48, 44, and 57.

23
Performance evaluation of supp-
  • Data and Resources Used

  • 1.Real life meteorological dataset and a number
    of benchmark dataset.

  • 2.All algorithms implemented in C programming
    language.

  • 3.Tested on 2.2 GHz Opteron system with 2GB RAM.

  • 4.supp- is more applied than the other two
    measures.
  • Observations

  • 1.Implementation of Explicit pairs ran out of
    memory for 1000 records.

  • 2.Kd-trees incur almost no memory overhead and
    fits in main memory.

  • 3.Range trees uses a lot of memory and scaled
    hundreds of thousands of

  • records depending on t he minsup parameter.

24
Performance of suppp and suppF
  • Mining patterns in both these
    cases is much more efficient than in supp-. But
    both these methods are very sensitive to minimum
    support level chosen. e.g. for suppp the value of
    minimum support must be very low, in the range of
    tenth of millionth of percent. It shows that
    theoretical maximum value is rarely achieved.
  • Conclusions
  • 1.Three new
    support measures for continuous attributes based
    on rank methods
  • are
    introduced.
  • 2.
    Discretization required previously for numerical
    data is removed.
  • 3.
    Algorithms are presented for mining frequent
    itemsets based on those

  • definitions.
  • 4. Tests
    results performed on real world meteorological
    data are presented and
  • the
    efficiency of the algorithms have been verified
    on large databases.

25
Questions
Write a Comment
User Comments (0)
About PowerShow.com