Machine Learning

About This Presentation

Title:

Machine Learning

Description:

Temperature: hot, mild, cool. Humidity: normal, high. Wind: weak, strong. Classification ... Cool. Mild. Hot. E=.985. E=.592. E=.811. E=.1.0 ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 27

Provided by: Kathleen268

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning

1
Machine Learning

Reading Chapter 18

2
Choosing the Best Attribute Information Gain

Information gain (from attribute test)
difference between the original information
requirement and new requirement
Gain(A)H(p/pn,n/pn)-Remainder(A)
Hentropy
Highest when the set is equally divided between
positive (p) and negative (n) examples (.5,.5)
(value of 1)
Lower as the set becomes more unbalanced (e.g.,
(.9, .1) )

3
Information based on attributes
Pn10, so H(1/2,1/2) 1 bit
Remainder (A)
4
Text Classification

Is texti a finance new article?

Positive
Negative
5
Example
6
stock
rolling
?10
5-10
?10
5-10
9,10
1,5,6,8
2,3,4,7
1,8,9,10
2,3,4,5,6,7
Gain(stock)1-4/10H(1/4,3/4)6/10H(4/6,2/6)
1-.4((-.25(-2))-(.75-.42)).6
((-.67-.58)-(..33-1.6))
1-.33.55.12 Gain(rolling)1-4/10H(1/2,1/2)4/
10H(1/2,1/2)2/10H(1/2,1/2)0
7
Johns Question

Does the algorithm help us determine the values?
How do we know to divide at 10?

8
Algorithm as specified so far is designed for
binary classification, attributes with discrete
values

Attributes
Outlook sunny, overcast, rain
Temperature hot, mild, cool
Humidity normal, high
Wind weak, strong
Classification
PlayTennis? Yes, No

9
(No Transcript)
10
Humidity E.940 (9/14 yes)
Wind E.94
Strong
Weak
High
Normal
E.985
E.811
E.592
E.1.0
Gain(humidity).940-(7/14).985-(7/14).592
Gain(wind).940-(8/14).811-(7/14)1.0
Temperature E.940
Outlook E.940
Overcast
Cool
Mild
Sunny
Hot
Rain
Gain(S,Outlook).246, Gain(S,Humidity).151,
Gain(S,Wind).048, Gain(S,Temperature).029 Outloo
k is selected because it has highest gain
11
(No Transcript)
12
(No Transcript)
13
Extending the algorithm for continuous valued
attributes

Dynamically define new discrete-valued attributes
that partition the continuous attribute into a
discrete set of intervals
For continuous A, create Ac that is true if Afalse otherwise
How to select the best value for threshold c?
Sort examples by continuous attribute
Identify adjacent examples that differ in target
classification
Generate a set of candidate thresholds midway
between corresponding values of A
Choose threshold c that maximizes information gain

14
Example temperature as continuous value

Two candidate thresholds
(4860)/2
(8090)/2
Information gain greater for Temperature54 than
for Temperature85

15
Other cases

What if class is discrete valued, not binary?
What if an attribute has many values (e.g., 1 per
instance)?

16
Training vs. Testing

A learning algorithm is good if it uses its
learned hypothesis to make accurate predictions
on unseen data
Collect a large set of examples (with
classifications)
Divide into two disjoint sets the training set
and the test set
Apply the learning algorithm to the training set,
generating hypothesis h
Measure the percentage of examples in the test
set that are correctly classified by h
Repeat for different sizes of training sets and
different randomly selected training sets of each
size.

17
(No Transcript)
18
Overfitting

Learning algorithms may use irrelevant attributes
to make decisions
For news, day published and newspaper
When else can overfitting occur?
Solution 1 Decision tree pruning
Prune away attributes with low information gain
Use statistical significance to test whether gain
is meaningful

19
K-fold Cross Validation

Solution 2 To reduce overfitting
Run k experiments
Use a different 1/k of data for testing each time
Average the results
5-fold, 10-fold, leave-one-out

20
Cross-Validation
Labeled data (1566)
Split into 10 folds
9 folds (approx. 1409)
1 fold (approx. 157)
Train
Model
Evaluate
Lather, rinse, repeat (10 times)
Report average
21
Example
22
Division into 3 sets

Avoid inadvertent peeking
Parameters that must be learned (e.g., how to
split values)
Generate different hypotheses for different
parameter values on training data
Choose values that perform best on validation
data
Test final algorithm on testing data
Why do we need to do this for selecting best
attributes?

23
Ensemble Learning

Learn from a collection of hypotheses
Majority voting
Enlarges the hypothesis space

24
Boosting

Uses a weighted training set
Each example has an associated weight wj?0
Higher weighted examples have higher importance
Initially, wj1 for all examples
Next round increase weights of misclassified
examples, decrease other weights
From the new weighted set, generate hypothesis h2
Continue until M hypotheses generated
Final ensemble hypothesis weighted-majority
combination of all M hypotheses
Weight each hypothesis according to how well it
did on training data

25
AdaBoost

If input learning algorithm is a weak learning
algorithm
L always returns a hypothesis with weighted error
on training slightly better than random
Returns hypothesis that classifies training data
perfectly for large enough M
Boosts the accuracy of the original learning
algorithm on training data

26
Tutorial on using C4.5

Write a Comment

User Comments (0)