Extended Random Sets for Knowledge Discovery in Information Systems - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Extended Random Sets for Knowledge Discovery in Information Systems

Description:

Icy. Misty. 1. N. Consequence Accident. Time. Driving condition Road ... the weather is foggy and road is icy then the accident occurred at night' in 140 ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 31
Provided by: Roe52
Category:

less

Transcript and Presenter's Notes

Title: Extended Random Sets for Knowledge Discovery in Information Systems


1
Extended Random Sets for Knowledge Discovery in
Information Systems
  • Yuefeng Li
  • School of Software Engineering and Data
    Communications
  • Queensland University of Technology
  • Brisbane 4001, Australia
  • Email y2.li_at_qut.edu.au

2
Data Mining
  • Data mining, which is also referred to as
    knowledge discovery in database is a process of
    nontrivial extraction of implicit, previously
    unknown and potentially useful information
    (patterns) from data in databases
  • Typical approaches
  • Data classification
  • Data clustering
  • Dining association rules

3
Association Rules
  • The objective of mining association rules
  • is to discover all rules that have support and
    confidence greater than the user-specified
    minimum support and minimum confidence
  • The form of a rule is
  • A1 ? A2 ??Am ? B1 ? B2 ??Bm ,
  • where Ai and Bj are sets of attributes values
    from the relevant datasets in a database.

4
Association Rules cont.
  • A ? B is an interesting rule iff P(BA) P(B) is
    greater than a suitable constant.
  • Criteria
  • Frequency of occurrence is a well-accepted
    criterion.
  • The rules should reflect real world phenomena,
    that is, data mining is to find real world
    patterns.
  • It is desirable to use some mathematical models
    to interpret association rules in order to obtain
    useful patterns.

5
Databases to Decision Tables

Table 1. A decision table
6
An Example cont.
  • Attributes driver, vehicle type, weather,
    road, time, accident
  • Condition attributes weather, road
  • Decision attributes time, accident
  • Decision rule, e.g., if the weather is foggy and
    road is icy then the accident occurred at night
    in 140 cases.

7
Formalization Rough Set
  • S (U, A) -- an information system
  • Where, U, a database, a set of records.
  • A, a set of attributes and
  • There is a function for every attribute a?A such
    that a U ? Va, where Va is the set of all values
    of a. We call Va the domain of a.

8
Formalization Rough Set cont.
  • B-granule
  • Let B be a subset of A. B determines a binary
    relation I(B) on U such that (x, y) ? I(B) if and
    only if a(x) a(y) for every a?B, where a(x)
    denotes the value of attribute a for element x?U.
  • I(B) is an equivalent relation, it determines a
    family of all equivalent classes of I(B)
  • The partition determined by B, is denoted by U/B.
  • The classes in U/B are referred to B-granules.
  • The class which contains x is called B-granule
    induced by x, and is denoted by B(x).

9
Formalization Rough Set cont.
  • (U, C, D) is called a decision table of (U, A),
    iff
  • C?D?A, where C, condition attributes, and D,
    decision attributes, are disjoint sets of A.
  • C(x) and D(x) indicate the condition granule and
    the decision granule induced by x, respectively.
  • L is a language defined using attributes of A, an
    atomic formula is given by a v, where a?A and
    v?Va.
  • Formulas can be also formed by logical negation,
    conjunction and disjunction.
  • A formula is called a basic formula in this paper
    if it is an atomic formula or is formed only by
    conjunction.

10
Formalization Rough Set cont.
  • In table 1, if C weather, road, and D
    time, accident. Then we have
  • U/C 1,7, 2,5, 3,6, 4 c1, c2, c3,
    c4 the set of condition granules
  • U/D 1, 2,3,7, 4, 5,6 d1, d2, d3,
    d4 the set of decision granules
  • (U,C,D) is a decision table of (U,A), where U is
    a database which includes 1000 records.

11
Pawlaks Interpretation
  • Assumption - Each fact in the decision table is a
    subset of U in which all elements have the same
    values for all attributes
  • Every class f determines a rule f(C )? f(D).
  • The strength of the decision rule f(C )? f(D) is
    defined as C(f)?D(f) / U and
  • The certainty factor of the decision rule is
    defined as C(f)?D(f) / C(f).

12
Pawlaks Interpretation cont.
13
Pawlaks Interpretation cont.
Table 2. Strengths and certainty factors of
decision rules
14
Extended Random Sets
  • The relationships between the premises and the
    conclusions of decision rules.
  • c1 ? (d1, 80/100), (d2, 20/100)
  • c2 ? (d2, 140/160), (d4, 20/160)
  • c3 ? (d2 , 40/240), (d4, 200/240)
  • c4 ? (d3, 500/500)

15
Extended Random Sets cont.
  • We use a mapping to formalize the relationship

and
for all ci?U/C.
16
Extended Random Sets cont.
  • Use the frequency in the decision table for
    support degree of each condition granule. We
    have

for every condition granule ci, where, Nx is the
number of analogous cases of fact x. By
normalizing, we can get a probability function P
on U/C such that
17
Extended Random Sets cont.
  • We call the pair (?, P) an extended random set.
  • for a given condition granule ci, we assume

we can obtain the following decision rules
18
Extended Random Sets cont.
  • We define the strengths of the decision rules are

And, the corresponding certainty factors are
19
Extended Random Sets cont.
  • A decision rule

is an interesting rule if
is greater than a suitable constant.
20
Extended Random Sets cont.
  • Where,

We can prove that pr is a probability function on
(U/D).
21
Extended Random Sets cont.
  • Example of an extended random set

22
Extended Random Sets cont.
Table 3. Probability function on the set of
decision granules
23
Extended Random Sets cont.
Table 4. Interesting rules
24
Interpretation of Extended Random Sets
  • A very interesting phenomena form Table 3
  • Only some descriptions on the set of decision
    granules are meaningful for a given information
    system if we use or to combine decision
    granules.
  • E.g.,
  • d1 or d2 -- (accident yes)
  • d1 or d3 -- ?
  • The concept of meaningful
  • A description X on the set of decision granules
    of decision table (U, C, D) is meaningful if
    there is a decision table (U, E, F), such that
    E?C, and X?F.

25
Interpretation of Extended Random Sets cont.
  • The derived random set (?, P) from the extended
    random set (?, P)

It can determines a Dempster-Shafer mass function
m on ? such that
26
Interpretation of Extended Random Sets cont.
Table 5. Uncertain measures on the set of
decision granules
27
Algorithm 1 from Pawlaks Method
  • let UN 0
  • for (i 1 to n ) // n is the number of classes
  • UN UN Ni
  • for (i 1 to n)
  • strength(i) Ni/UN CN Ni
  • for (j 1 to n)
  • if ((j?i) and (fj(C) fi(C)))
  • CN CN Nj
  • certainty_factor(i) Ni/CN .

28
Algorithm 2 from extended random sets
  • let UN 0, U/C ?
  • for (i 1 to n)
  • UN UN Ni
  • for (i 1 to n do ) // create the data structure
  • if (fi(C)? U/C)
  • insert((fi(D), Ni)) to ?(fi(C))
  • else
  • add(fi(C)) into U/C, and set ?(fi(C))?
  • for (i 1 to U/C)
  • P(ci) (1/UN ) ? ()
  • for (i 1 to U/C) // normalization
  • temp 0
  • for (j 1 to ?(ci))
  • temp temp sndi,j
  • for (j 1 to ?(ci))
  • sndi,j sndi,j/temp
  • for (i 1 to U/C) // calculate rule strengths
  • for (j 1 to ?(ci))
  • strength(ci?fsti,j) P(ci) ? sndi,j

29
Algorithm Analysis
  • Algorithm 1
  • time complexity is O(n2), where n is the number
    of classes in the decision table.
  • Algorithm 2
  • the time complexity is O(n?U/C)
  • U/C n, Algorithm 2 is better than Algorithm
    1 for the time complexity.

30
Summary
  • The advantages of our approach can be summarized
    as follows
  • It provides a new algorithm to calculate decision
    rules, which is faster than Pawlaks algorithm
  • In addition to the well-accepted criterion
    frequencies, the extended random sets are
    easily to include other criteria when determining
    association rules
  • The extended random sets can provide more than
    one measures for dealing with uncertainties in
    the association rules. This is a significant
    distinguished characteristic from other methods.
Write a Comment
User Comments (0)
About PowerShow.com