Data mining - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data mining

Description:

Every internal node corresponds to a predictor field ... Identify the set M of all (maximal) sets V sith support at least s as follows: ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 28
Provided by: JD146
Category:
Tags: data | mining | sith

less

Transcript and Presenter's Notes

Title: Data mining


1
Data mining
  • Decision Trees Rule induction

2
Decision Trees
  • Are for classification

3
Classification
  • Input data record
  • Output Class to which this record belongs
  • Example
  • (young,bright,polite,hard working) -gt Student KE

4
Decision Tree
  • Every internal node corresponds to a predictor
    field
  • Every arc corresponds to a value of that field
  • Every leaf corresponds to a class, a value of the
    prediction field

5
Idea behind ID3
  • At each level, in each branch, choose the
    prediction field which is not on a path to the
    root, and is most informative about the
    prediction.
  • Information is measured in entropy.

Idea behind C4.5
  • Be intelligen about missing values, continuous
    values, pruning, rule induction...

6
What is entropy?
  • Say we choose a set of M(m1,m2,mn) messages to
    exchange information.
  • Then there are n different messages, and we need
    at least log n bits to distinguish between them,
    and hence to exchange information.

7
What is entropy? (cont.)
  • The messages are being exchanged with certain
    relative frequeuncies.
  • For i1n, these relative frequencies
    (p1,p2,,pn) can be interpreted as the
    probability pi, that an exchanged message is
    message mi.

8
What is entropie? (cont.)
  • The entropy I(P) of a probability distribution
    function P, measures the information being
    exchanged as follows
  • I(P) - (p1 log p1
  • p2 log p2
  • ..
  • pn log pn )
  • Where (of course) 0 log 0 0..

9
Example for each i, pi 1/n
  • P1/n,1/n,1/n.
  • I(P) -(1/n log 1/n 1/n log 1/n... )
  • - log 1/n
  • log n.

10
More examples
  • P1 --gt entropie(P)0.
  • P0.5,0.5 --gt entropie(P)1.
  • P0.67,0.33 --gt entropie(P).92.
  • P1,0 --gt entropie(P)0.
  • Voor a given value of n the entropy increases if
    the differences between the probabilities
    decrease.

11
ID3
  • Construct the decision tree such that the
    decrease in entropy is maximized.
  • (it is a greedy approach)
  • The intent is that each node increases the
    certainty about the class of the record as much
    as possible.

12
Entropy in Classification
  • Denote by T the number of records, and assume
    there are k classes, C1,,Ck with corresponding
    prediction values c1,,ck.
  • Thus, k is in fact the number of different values
    for prediction C. Now, define the probability
    distribution function
  • PC1/T,C2/T,,Cn/T.

13
Entropy in Classification(cont.)
  • Now consider predictor D with values d1,..,dl,
    and define
  • pj(D) C1 EN Dj/Dj,
  • C2 EN Dj/Dj,
  • ... ,
  • Cn EN Dj/Dj
  • The relative frequencies, under the condition
    that D has value dj
  • Subsequently compute the entropy I(pj(D)).

14
Entropy in Classification
  • Next define
  • Info(D,T) D1/T I(p1(D))
  • D2/T I(p2(D))
  • .
  • .
  • Dl/T I(pl(D)) .
  • and
  • Gain(D,T)I(T)-Info(D,T).

15
The ID3 algorithm
  • ID3(T,R,C)
  • Input
  • training set T,
  • Set of predictor fields R
  • prediction field C.

16
ID3 (cont).
  • While T is not empty
  • If R is empty, return 1 node with as prediction
    value, the value which maximes Ci.
  • IfR is not empty, let D in R be the predictor
    which mazimizes Gain (D,T).
  • Return a tree with node D, and a branch for
    every j1..l, which leads to the tree/root ID3(
    Dj, R\D,C)

17
Rule Induction
  • The sequence of branches from the root, can be
    viewed as a classisification rule.
  • (If the root-field has value X, and the next
    field has value Y, and.., then, the record is in
    class Ci.
  • Induction fact based reasoning

18
Assocation rules
  • Example
  • (Saturday, beer, chips) --gt (dipers)
  • BIS applications of assocation rules
  • - Identifying prospects
  • - Identifying customers which will cause trouble

19
definitions
  • Antecedent Set A of records, with certain values
    for a set of predictor fields
  • Consequence Set C of Records, with certain
    values for a set of prediction fields
  • (Saturday, beer, chips) --gt (dipers)

20
definitions
  • Support percentage of the records which belong
    to A and C.
  • Lift percentage of records of C which also
    belongs to A
  • p(A and C)/p(C)
  • Confidence percentage of records of A which also
    belongs to C
  • p(A and C)/p(A)

21
Rule Induction
  • Generate all rules with support at least s, and
    confidence at least c.

22
Brute force algoritme
  • Generate all rules and check whether they satisfy
    the support and confidence requirements.
  • Complexity?

23
Intelligent algorithm
  • Identify the set M of all (maximal) sets V sith
    support at least s as follows
  • -First check all sets with cardinality 1.
  • -Then check alls sets V with cardinality 2.
    1,2 can only qualifies if 1 and 2 qualify.
    (this condition is necessary but not sufficient).
  • -et cetera. 1,2,,n can only qualify if
    1,2,n-1,,1,3,,n,2,3,,n qualify.

24
Intelligent algorithm
  • For every set l in M, check whether there is a
    subset Q in l such that
  • l --gt Q\l has sufficent confidence.
  • The confidence of this rule equals
    support(Q)/support(l)
  • Notice that if l --gt Q\l doesnt have sufficient
    confidence, the same holds for supersets of l.

25
Intelligent algorithm
  • Complexity Only sets with sufficent support are
    being generated, and subsequently only the rules
    with sufficient confidence.
  • Complexity?

26
Exercise soccer
27
Questions
  • Goal Predict nationality (NL)
  • 1. Show that ID3 might start with voorkeursbeen.
  • Show that ID3 might continue with haarkleur
  • Complete the ID3.
  • Is there a better tree?
  • What is the ratio between the number of nodes in
    both trees?
  • What is the ratio between the number of levels?
  • Can you think of an even worse example?
Write a Comment
User Comments (0)
About PowerShow.com