Midterm 3 Revision and ID3 - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Midterm 3 Revision and ID3

Description:

Title: ID3 and C4.5 Author: Lee Last modified by: Computer Science Department Created Date: 4/29/2006 5:24:03 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 79
Provided by: Lee149
Learn more at: http://www.cs.sjsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Midterm 3 Revision and ID3


1
Midterm 3 Revision and ID3
Lecture 20
  • Prof. Sin-Min Lee

2
Armstrongs Axioms
  • We can find F by applying Armstrongs Axioms
  • if ? ? ?, then ? ? ?
    (reflexivity)
  • if ? ? ?, then ? ? ? ? ?
    (augmentation)
  • if ? ? ?, and ? ? ?, then ? ? ? (transitivity)
  • These rules are
  • sound (generate only functional dependencies that
    actually hold) and
  • complete (generate all functional dependencies
    that hold).

3
Additional rules
  • If ? ? ? and ? ? ?, then ? ? ? ? (union)
  • If ? ? ? ?, then ? ? ? and ? ? ? (decomposition)
  • If ? ? ? and ? ? ? ?, then ? ? ? ?
    (pseudotransitivity)
  • The above rules can be inferred from Armstrongs
    axioms.

4
Example
  • R (A, B, C, G, H, I)F A ? B A ? C CG
    ? H CG ? I B ? H
  • Some members of F
  • A ? H
  • by transitivity from A ? B and B ? H
  • AG ? I
  • by augmenting A ? C with G, to get AG ? CG
    and then transitivity with CG ? I
  • CG ? HI
  • by augmenting CG ? I to infer CG ? CGI,
  • and augmenting of CG ? H to infer CGI ? HI,
  • and then transitivity

5
2. Closure of an attribute set
  • Given a set of attributes A and a set of FDs F,
    closure of A under F is the set of all attributes
    implied by A
  • In other words, the largest B such that
  • A ? B
  • Redefining super keys
  • The closure of a super key is the entire
    relation schema
  • Redefining candidate keys
  • 1. It is a super key
  • 2. No subset of it is a super key

6
Computing the closure for A
  • Simple algorithm
  • 1. Start with B A.
  • 2. Go over all functional dependencies, ? ? ? ,
    in F
  • 3. If ? ? B, then
  • Add ? to B
  • 4. Repeat till B changes

7
Example
  • R (A, B, C, G, H, I)F A ? B A ? C CG
    ? H CG ? I B ? H
  • (AG) ?
  • 1. result AG
  • 2. result ABCG (A ? C and A ? B)
  • 3. result ABCGH (CG ? H and CG ? AGBC)
  • 4. result ABCGHI (CG ? I and CG ? AGBCH
  • Is (AG) a candidate key ?
  • 1. It is a super key.
  • 2. (A) BC, (G) G.
  • YES.

8
Uses of attribute set closures
  • Determining superkeys and candidate keys
  • Determining if A ? B is a valid FD
  • Check if A contains B
  • Can be used to compute F

9
3. Extraneous Attributes
  • Consider F, and a functional dependency, A ? B.
  • Extraneous Are there any attributes in A or B
    that can be safely removed ?
  • Without changing the constraints implied by F
  • Example Given F A ? C, AB ? CD
  • C is extraneous in AB ? CD since AB ? C can be
    inferred even after deleting C

10
4. Canonical Cover
  • A canonical cover for F is a set of dependencies
    Fc such that
  • F logically implies all dependencies in Fc, and
  • Fc logically implies all dependencies in F, and
  • No functional dependency in Fc contains an
    extraneous attribute, and
  • Each left side of functional dependency in Fc is
    unique
  • In some (vague) sense, it is a minimal version of
    F
  • Read up algorithms to compute Fc

11
Loss-less Decompositions
  • Definition A decomposition of R into (R1, R2) is
    called lossless if, for all legal instance of
    r(R)
  • r ?R1 (r ) ?R2 (r )
  • In other words, projecting on R1 and R2, and
    joining back, results in the relation you started
    with
  • Rule A decomposition of R into (R1, R2) is
    lossless, iff
  • R1 n R2 ? R1 or R1 n R2 ? R2
  • in F.

12
Dependency-preserving Decompositions
  • Is it easy to check if the dependencies in F hold
    ?
  • Okay as long as the dependencies can be checked
    in the same table.
  • Consider R (A, B, C), and F A ? B, B ? C
  • 1. Decompose into R1 (A, B), and R2 (A, C)
  • Lossless ? Yes.
  • But, makes it hard to check for B ? C
  • The data is in multiple tables.
  • 2. On the other hand, R1 (A, B), and R2 (B,
    C),
  • is both lossless and dependency-preserving
  • Really ? What about A ? C ?
  • If we can check A ? B, and B ? C, A ? C is
    implied.

13
Dependency-preserving Decompositions
  • Definition
  • Consider decomposition of R into R1, , Rn.
  • Let Fi be the set of dependencies F that
    include only attributes in Ri.
  • The decomposition is dependency preserving,
    if
  • (F1 ? F2 ? ? Fn ) F

14
Example
  • Suppose we have R(A,B,C) with
  • FD1. A?B
  • FD2.A?C
  • FD3. B?C
  • The decomposition R1(A,B),R2(A,C) is lossless but
    not dependency preservation.
  • The decomposition R1(A,B),R2(A,C) and R3(B,C) is
    dependency preservation.

15
BCNF
  • Given a relation schema R, and a set of
    functional dependencies F, if every FD, A ? B, is
    either
  • 1. Trivial
  • 2. A is a superkey of R
  • Then, R is in BCNF (Boyce-Codd Normal Form)
  • Why is BCNF good ?

16
BCNF
  • What if the schema is not in BCNF ?
  • Decompose (split) the schema into two pieces.
  • Careful you want the decomposition to be
    lossless

17
Achieving BCNF Schemas
  • For all dependencies A ? B in F, check if A is a
    superkey
  • By using attribute closure
  • If not, then
  • Choose a dependency in F that breaks the BCNF
    rules, say A ? B
  • Create R1 A B
  • Create R2 A (R B A)
  • Note that R1 n R2 A and A ? AB ( R1), so this
    is lossless decomposition
  • Repeat for R1, and R2
  • By defining F1 to be all dependencies in F that
    contain only attributes in R1
  • Similarly F2

18
Example 1
  • R (A, B, C)
  • F A ? B, B ? C
  • Candidate keys A
  • BCNF No. B ? C violates.

  • R1 (B, C)
  • F1 B ? C
  • Candidate keys B
  • BCNF true
  • R2 (A, B)
  • F2 A ? B
  • Candidate keys A
  • BCNF true

19
Example 2-1
  • R (A, B, C, D, E)
  • F A ? B, BC ? D
  • Candidate keys ACE
  • BCNF Violated by A ? B, BC ? D etc

  • R1 (A, B)
  • F1 A ? B
  • Candidate keys A
  • BCNF true
  • R2 (A, C, D, E)
  • F2 AC ? D
  • Candidate keys ACE
  • BCNF false (AC ? D)
  • Dependency preservation ???
  • We can check
  • A ? B (R1), AC ? D (R3),
  • but we lost BC ? D
  • So this is not a dependency
  • -preserving decomposition
  • R3 (A, C, D)
  • F3 AC ? D
  • Candidate keys AC
  • BCNF true
  • R4 (A, C, E)
  • F4 only trivial
  • Candidate keys ACE
  • BCNF true

20
Example 2-2
  • R (A, B, C, D, E)
  • F A ? B, BC ? D
  • Candidate keys ACE
  • BCNF Violated by A ? B, BC ? D etc

  • R1 (B, C, D)
  • F1 BC ? D
  • Candidate keys BC
  • BCNF true
  • R2 (B, C, A, E)
  • F2 A ? B
  • Candidate keys ACE
  • BCNF false (A ? B)
  • Dependency preservation ???
  • We can check
  • BC ? D (R1), A ? B (R3),
  • Dependency-preserving
  • decomposition
  • R3 (A, B)
  • F3 A ? B
  • Candidate keys A
  • BCNF true
  • R4 (A, C, E)
  • F4 only trivial
  • Candidate keys ACE
  • BCNF true

21
Example 3
  • R (A, B, C, D, E, H)
  • F A ? BC, E ? HA
  • Candidate keys DE
  • BCNF Violated by A ? BC etc

  • R1 (A, B, C)
  • F1 A ? BC
  • Candidate keys A
  • BCNF true
  • R2 (A, D, E, H)
  • F2 E ? HA
  • Candidate keys DE
  • BCNF false (E ? HA)
  • Dependency preservation ???
  • We can check
  • A ? BC (R1), E ? HA (R3),
  • Dependency-preserving
  • decomposition
  • R3 (E, H, A)
  • F3 E ? HA
  • Candidate keys E
  • BCNF true
  • R4 (ED)
  • F4 only trivial
  • Candidate keys DE
  • BCNF true

22
Classification vs. Prediction
  • Classification
  • When a classifier is build it predicts
    categorical class labels of new data classifies
    unknown data. We also say that it predicts class
    labels of the new data
  • Construction of the classifier (a model) is based
    on a training set in which the values of a
    decision attribute (class labels) are given and
    is tested on a test set
  • Prediction
  • Statistical method that models continuous-valued
    functions, i.e., predicts unknown or missing
    values

23
Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
24
Testing and Prediction (by a classifier)
(Jeff, Professor, 4)
Tenured?
25
(No Transcript)
26
J. Ross Quinlan originally developed ID3 at the
University of Sydney. He first presented ID3 in
1975 in a book, Machine Learning, vol. 1, no. 1.
ID3 is based off the Concept Learning System
(CLS) algorithm. The basic CLS algorithm over a
set of training instances C Step 1 If all
instances in C are positive, then create YES node
and halt. If all instances in C are negative,
create a NO node and halt. Otherwise select a
feature, F with values v1, ..., vn and create a
decision node. Step 2 Partition the training
instances in C into subsets C1, C2, ..., Cn
according to the values of V. Step 3 apply the
algorithm recursively to each of the sets
Ci. Note, the trainer (the expert) decides which
feature to select.
27
ID3 improves on CLS by adding a feature selection
heuristic. ID3 searches through the attributes of
the training instances and extracts the attribute
that best separates the given examples. If the
attribute perfectly classifies the training sets
then ID3 stops otherwise it recursively operates
on the n (where n number of possible values of
an attribute) partitioned subsets to get their
"best" attribute. The algorithm uses a greedy
search, that is, it picks the best attribute and
never looks back to reconsider earlier choices.
28
A decision tree is a tree in which each branch
node represents a choice between a number of
alternatives, and each leaf node represents a
classification or decision.
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
  • Choosing Attributes and ID3
  • The order in which attributes are chosen
    determines how complicated the tree is.
  • ID3 uses information theory to determine the most
    informative attribute.
  • A measure of the information content of a message
    is the inverse of the probability of receiving
    the message
  • information1(M) 1/probability(M)
  • Taking logs (base 2) makes information correspond
    to the number of bits required to encode a
    message
  • information(M) -log2(probability(M))

33
  • Information
  • The information content of a message should be
    related to the degree of surprise in receiving
    the message.
  • Messages with a high probability of arrival are
    not as informative as messages with low
    probability.
  • Learning aims to predict accurately, i.e. reduce
    surprise.
  • Probabilities are multiplied to get the
    probability of two or more things both/all
    happening. Taking logarithms of the probabilities
    allows information to be added instead of
    multiplied.

34
A measure used from Information Theory in the ID3
algorithm and many others used in decision tree
construction is that of Entropy. Informally, the
entropy of a dataset can be considered to be how
disordered it is. It has been shown that entropy
is related to information, in the sense that the
higher the entropy, or uncertainty, of some data,
then the more information is required in order to
completely describe that data. In building a
decision tree, we aim to decrease the entropy of
the dataset until we reach leaf nodes at which
point the subset that we are left with is pure,
or has zero entropy and represents instances all
of one class (all instances have the same value
for the target attribute).
35
We measure the entropy of a dataset,S, with
respect to one attribute, in this case the target
attribute, with the following calculation
where Pi is the proportion of instances in the
dataset that take the ith value of the target
attribute This probability measures give us an
indication of how uncertain we are about the
data. And we use a log2 measure as this
represents how many bits we would need to use in
order to specify what the class (value of the
target attribute) is of a random instance.
36
Using the example of the marketing data, we know
that there are two classes in the data and so we
use the fractions that each class represents in
an entropy calculation Entropy (S 9/14
responses, 5/14 no responses) -9/14 log2 9/14
- 5/14 log2 5/14 0.947 bits
37
  • Entropy
  • Different messages have different probabilities
    of arrival.
  • Overall level of uncertainty (termed entropy) is
  • -S P log2P
  • Frequency can be used as a probability estimate.
  • E.g. if there are 5 ve examples and 3 -ve
    examples in a node the estimated probability of
    ve is 5/8 0.625.

38
Example
  • Initial decision tree is one node with all
    examples.
  • There are 4 ve examples and 3 -ve examples
  • i.e. probability of ve is 4/7 0.57
    probability of -ve is 3/7 0.43
  • Entropy is - (0.57 log 0.57) - (0.43 log
    0.43) 0.99

39
  • Evaluate possible ways of splitting.
  • Try split on size which has three values large,
    medium and small.
  • There are four instances with size large.
  • There are two large positives examples and two
    large negative examples.
  • The probability of ve is 0.5
  • The entropy is - (0.5 log 0.5) - (0.5 log
    0.5) 1

40
  • There is one small ve and one small -ve
  • Entropy is - (0.5 log 0.5) - (0.5 log 0.5)
    1
  • There is only one medium ve and no medium -ves,
    so entropy is 0.
  • Expected information for a split on size is
  • The expected information gain is 0.99 - 0.86
    0.13

41
  • Now try splitting on colour and shape.
  • Colour has an information gain of 0.52
  • Shape has an information gain of 0.7
  • Therefore split on shape.
  • Repeat for all subtree

42
(No Transcript)
43
(No Transcript)
44
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
45
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Rain
Humidity
High
Normal
No
Yes
46
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ?
47
Decision Tree for Conjunction
OutlookSunny ? WindWeak
Outlook
Sunny
Overcast
Rain
Wind
No
No
Strong
Weak
No
Yes
48
Decision Tree for Disjunction
OutlookSunny ? WindWeak
Outlook
Sunny
Overcast
Rain
Yes
Wind
Wind
Strong
Weak
Strong
Weak
No
Yes
No
Yes
49
Decision Tree for XOR
OutlookSunny XOR WindWeak
Outlook
Sunny
Overcast
Rain
Wind
Wind
Wind
Strong
Weak
Strong
Weak
Strong
Weak
Yes
No
No
Yes
No
Yes
50
Decision Tree
  • decision trees represent disjunctions of
    conjunctions

(OutlookSunny ? HumidityNormal) ?
(OutlookOvercast) ? (OutlookRain ?
WindWeak)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
Information Gain Computation (ID3/C4.5) Case of
Two Classes
61
(No Transcript)
62
ID3 algorithm
  • To get the fastest decision-making procedure, one
    has to arrange attributes in a decision tree in a
    proper order - the most discriminating attributes
    first. This is done by the algorithm called ID3.
  • The most discriminating attribute can be defined
    in precise terms as the attribute for which the
    fixing its value changes the enthropy of possible
    decisions at most. Let wj be the frequency of
    the j-th decision in a set of examples x. Then
    the enthropy of the set is
  • E(x) - Swj log(wj)
  • Let fix(x,a,v) denote the set of these elements
    of x whose value of attribute a is v. The
    average enthropy that remains in x , after the
    value a has been fixed, is
  • H(x,a) S kv E(fix(x,a,v)),
  • where kv is the ratio of examples in x
    with attribute a having value v.

63
(No Transcript)
64
(No Transcript)
65
Ok now we want a quantitative way of seeing the
effect of splitting the dataset by using a
particular attribute (which is part of the tree
building process). We can use a measure called
Information Gain, which calculates the reduction
in entropy (Gain in information) that would
result on splitting the data on an attribute, A.
where v is a value of A, Sv is the subset
of instances of S where A takes the value v, and
S is the number of instances
66
Continuing with our example dataset, let's name
it S just for convenience, let's work out the
Information Gain that splitting on the attribute
District would result in over the entire dataset
  • So by calculating this value for each attribute
    that remains, we can see which attribute splits
    the data more purely. Let's imagine we want to
    select an attribute for the root node, then
    performing the above calcualtion for all
    attributes gives
  • Gain(S,House Type) 0.049 bits
  • Gain(S,Income) 0.151 bits
  • Gain(S,Previous Customer) 0.048 bits

67
We can clearly see that District results in the
highest reduction in entropy or the highest
information gain. We would therefore choose this
at the root node splitting the data up into
subsets corresponding to all the different values
for the District attribute. With this node
evaluation technique we can procede recursively
through the subsets we create until leaf nodes
have been reached throughout and all subsets are
pure with zero entropy. This is exactly how ID3
and other variants work.
68
Example 1 If S is a collection of 14 examples
with 9 YES and 5 NO examples then Entropy(S) -
(9/14) Log2 (9/14) - (5/14) Log2 (5/14)
0.940 Notice entropy is 0 if all members of S
belong to the same class (the data is perfectly
classified). The range of entropy is 0
("perfectly classified") to 1 ("totally
random"). Gain(S, A) is information gain of
example set S on attribute A is defined
as Gain(S, A) Entropy(S) - S ((Sv / S)
Entropy(Sv)) Where S is each value v of all
possible values of attribute A Sv subset of S
for which attribute A has value v Sv number
of elements in Sv S number of elements in S
69
Example 2. Suppose S is a set of 14 examples in
which one of the attributes is wind speed. The
values of Wind can be Weak or Strong. The
classification of these 14 examples are 9 YES and
5 NO. For attribute Wind, suppose there are 8
occurrences of Wind Weak and 6 occurrences of
Wind Strong. For Wind Weak, 6 of the examples
are YES and 2 are NO. For Wind Strong, 3 are
YES and 3 are NO. Therefore Gain(S,Wind)Entropy(
S)-(8/14)Entropy(Sweak)-(6/14)Entropy(Sstrong)
0.940 - (8/14)0.811 - (6/14)1.00
0.048 Entropy(Sweak) - (6/8)log2(6/8) -
(2/8)log2(2/8) 0.811 Entropy(Sstrong) -
(3/6)log2(3/6) - (3/6)log2(3/6) 1.00 For each
attribute, the gain is calculated and the highest
gain is used in the decision node.
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
  • Decision Tree Construction Algorithm
    (pseudo-code)
  • Input A data set, S Output A decision tree
  • If all the instances have the same value for the
    target attribute then return a decision tree that
    is simply this value (not really a tree - more of
    a stump).
  • Else
  • Compute Gain values (see above) for all
    attributes and select an attribute with the
    lowest value and create a node for that
    attribute.
  • Make a branch from this node for every value of
    the attribute
  • Assign all possible values of the attribute to
    branches.
  • Follow each branch by partitioning the dataset to
    be only instances whereby the value of the branch
    is present and then go back to 1.

76
(No Transcript)
77
(No Transcript)
78
See the following articles for detail ID3
algorithm
(1)http//www.dcs.napier.ac.uk/peter/vldb/dm/node
11.html (2)http//www.cse.unsw.edu.au/billw/cs941
4/notes/ml/06prop/id3/id3.html
Write a Comment
User Comments (0)
About PowerShow.com