Title: Machine Learning
1Machine Learning
2Choosing the Best Attribute Information Gain
- Information gain (from attribute test)
difference between the original information
requirement and new requirement - Gain(A)H(p/pn,n/pn)-Remainder(A)
- Hentropy
- Highest when the set is equally divided between
positive (p) and negative (n) examples (.5,.5)
(value of 1) - Lower as the set becomes more unbalanced (e.g.,
(.9, .1) )
3Information based on attributes
Pn10, so H(1/2,1/2) 1 bit
Remainder (A)
4Text Classification
- Is texti a finance new article?
Positive
Negative
5Example
6stock
rolling
?10
5-10
?10
5-10
9,10
1,5,6,8
2,3,4,7
1,8,9,10
2,3,4,5,6,7
Gain(stock)1-4/10H(1/4,3/4)6/10H(4/6,2/6)
1-.4((-.25(-2))-(.75-.42)).6
((-.67-.58)-(..33-1.6))
1-.33.55.12 Gain(rolling)1-4/10H(1/2,1/2)4/
10H(1/2,1/2)2/10H(1/2,1/2)0
7Johns Question
- Does the algorithm help us determine the values?
How do we know to divide at 10?
8Algorithm as specified so far is designed for
binary classification, attributes with discrete
values
- Attributes
- Outlook sunny, overcast, rain
- Temperature hot, mild, cool
- Humidity normal, high
- Wind weak, strong
- Classification
- PlayTennis? Yes, No
9(No Transcript)
10Humidity E.940 (9/14 yes)
Wind E.94
Strong
Weak
High
Normal
E.985
E.811
E.592
E.1.0
Gain(humidity).940-(7/14).985-(7/14).592
Gain(wind).940-(8/14).811-(7/14)1.0
Temperature E.940
Outlook E.940
Overcast
Cool
Mild
Sunny
Hot
Rain
Gain(S,Outlook).246, Gain(S,Humidity).151,
Gain(S,Wind).048, Gain(S,Temperature).029 Outloo
k is selected because it has highest gain
11(No Transcript)
12(No Transcript)
13Extending the algorithm for continuous valued
attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute into a
discrete set of intervals - For continuous A, create Ac that is true if Afalse otherwise
- How to select the best value for threshold c?
- Sort examples by continuous attribute
- Identify adjacent examples that differ in target
classification - Generate a set of candidate thresholds midway
between corresponding values of A - Choose threshold c that maximizes information gain
14Example temperature as continuous value
- Two candidate thresholds
- (4860)/2
- (8090)/2
- Information gain greater for Temperature54 than
for Temperature85
15Other cases
- What if class is discrete valued, not binary?
- What if an attribute has many values (e.g., 1 per
instance)?
16Training vs. Testing
- A learning algorithm is good if it uses its
learned hypothesis to make accurate predictions
on unseen data - Collect a large set of examples (with
classifications) - Divide into two disjoint sets the training set
and the test set - Apply the learning algorithm to the training set,
generating hypothesis h - Measure the percentage of examples in the test
set that are correctly classified by h - Repeat for different sizes of training sets and
different randomly selected training sets of each
size.
17(No Transcript)
18Overfitting
- Learning algorithms may use irrelevant attributes
to make decisions - For news, day published and newspaper
- When else can overfitting occur?
- Solution 1 Decision tree pruning
- Prune away attributes with low information gain
- Use statistical significance to test whether gain
is meaningful
19K-fold Cross Validation
- Solution 2 To reduce overfitting
- Run k experiments
- Use a different 1/k of data for testing each time
- Average the results
- 5-fold, 10-fold, leave-one-out
20Cross-Validation
Labeled data (1566)
Split into 10 folds
9 folds (approx. 1409)
1 fold (approx. 157)
Train
Model
Evaluate
Lather, rinse, repeat (10 times)
Report average
21Example
22Division into 3 sets
- Avoid inadvertent peeking
- Parameters that must be learned (e.g., how to
split values) - Generate different hypotheses for different
parameter values on training data - Choose values that perform best on validation
data - Test final algorithm on testing data
- Why do we need to do this for selecting best
attributes? -
23Ensemble Learning
- Learn from a collection of hypotheses
- Majority voting
- Enlarges the hypothesis space
24Boosting
- Uses a weighted training set
- Each example has an associated weight wj?0
- Higher weighted examples have higher importance
- Initially, wj1 for all examples
- Next round increase weights of misclassified
examples, decrease other weights - From the new weighted set, generate hypothesis h2
- Continue until M hypotheses generated
- Final ensemble hypothesis weighted-majority
combination of all M hypotheses - Weight each hypothesis according to how well it
did on training data
25AdaBoost
- If input learning algorithm is a weak learning
algorithm - L always returns a hypothesis with weighted error
on training slightly better than random - Returns hypothesis that classifies training data
perfectly for large enough M - Boosts the accuracy of the original learning
algorithm on training data
26Tutorial on using C4.5