Title: Learning
1Learning
- Shyh-Kang Jeng
- Department of Electrical Engineering/
- Graduate Institute of Communication Engineering
- National Taiwan University
2References
- J. P. Bigus and J. Bigus, Constructing
Intelligent Agents with Java, Wiley Computer
Publishing, 1998 - S. Russell and P. Norvig, Artificial
Intelligence A Modern Approach, Englewood
Cliffs, NJ Prentice Hall, 1995
3Learning Agents
Performance standard
Sensors
Environment
feedback
changes
knowledge
Learning goals
Effectors
Agent
4Forms of Learning
- Rote learning
- Parameter or weight adjustment
- Induction
- Clustering, chunking, or abstraction of knowledge
- ------------------------------------------------
- Data mining as knowledge discovery
5Learning Paradigms
- Supervised learning
- Programming by examples
- Most common
- Historical data is often used as the training
data - Unsupervised learning
- Perform a type of feature detection
- Reinforcement learning
- Error information is less specific
6Neural Networks
- Borrowing heavily from the metaphor of the human
brain - Can be used in supervised, unsupervised, and
reinforcement learning scenarios - Can be used for classification, clustering, and
prediction - Most are implemented as programs running on
serial computers
7Nerve Cell
Axon from Another cell
Axonal arborization
Dendrite
Axon
Nucleus
Synapse
Soma
8Neuron
x1
w1j
x2
w2j
w3j
x3
yjf(sumj)
wnj
Processing unit j
xn
9Logistic (Sigmoid)Activation Function
activation
sum
10Back Propagation Network
11Generic Neural Network Learning Algorithm
- Assign weights randomly
- repeat
- for each e in examples
- o Output(e)
- t observed output values from e
- update the weights based on e, o, t
- until all examples correctly predicted or
- stopping criterion is reached
12Changes to the Weight
13Kohonen Map
14Changes to the Weight
15Models of Chord Classification
16Self-Organized Map for Musical Chord
Identification
17References for Decision Tree Learning
- T. Mitchell, Machine Learning, McGraw-Hill, 1997
- J. R. Quinlan, C4.5 Programs for Machine
Learning, Morgan Kaufmann Publishers, 1993
18Play Tennis
19Entropy and Information Gain
20Information Gain Computation
21Option 1
S9,5- E0.940
Wind
Strong
Weak
6,2- E0.811
3,3- E1.00
Gain(S, Wind) 0.94 (8/14)0.811-(6/14)1.00
0.048
22Option 2
S9,5- E0.940
Humidity
Normal
High
3,4- E0.985
6,1- E0.592
Gain(S, Humidity) 0.94 (7/14)0.985-(7/14)0.592
0.151
23Information Gain for Four Attributes
24Partially Learned Tree
D1, D2, , D14 9,5-
Outlook
Sunny
Rain
Overcast
D4,D5,D6,D10,D14 3,2-
D1,D2,D8,D9,D11 2,3-
D3,D7,D12,D13 4,0-
Yes
?
?
25Next Step
26Decision Tree
Outlook
Sunny
Rain
Overcast
Wind
Humidity
Yes
Weak
Normal
Strong
High
Yes
No
Yes
No
27Decision Tree Learning Algorithm (1)
- DecisionTreeLearning( examples, attributes,
default ) - if examples is empty return default
- if all examples have the same classification
return the classification - if attributes is empty return MajorityValue(examp
les) - best ChooseAttribute( attributes, examples )
- tree a new decision tree with root test best
28Decision Tree Learning Algorithm (2)
- 6. for each value vi of best do
- examplesi elements of examples with
- best vi
- subtree DecisionTreeLearning( examplesi,
attributes-best, MajorityValue(examples) ) - Add a branch to tree with label vi and subtree
subtree - 7. return tree
29Output of JDecisionTree
Type Action Yes ( 2.0 / 0.0 ) Type Romance
Yes ( 4.0 / 0.0 ) Type Funny No ( 3.0 / 0.0
) Type History Finish Comedy Yes (
3.0 / 0.0 ) Finish Tragedy Yes ( 0.0 / 0.0
) Finish Insipid Yes ( 0.0 / 0.0 )
Finish Unknown No ( 1.0 / 0.0 ) Type Horror
Yes ( 0.0 / 0.0 ) Type Warm Yes ( 0.0 / 0.0
) Type Science_fiction Yes ( 3.0 / 0.0 ) Type
Art Yes ( 2.0 / 0.0 ) Type Story
Class Movie Yes ( 0.0 / 0.0 ) Class
Series No ( 1.0 / 0.0 ) Class Soap_opera
Yes ( 0.0 / 0.0 ) Class Cartoon Yes ( 2.0
/ 0.0 ) Class Animation Yes ( 0.0 / 0.0 )
30??????????????IMPECCABLE
Agent
31??????????????IMPECCABLE
??
32Website of cable TV service provider
33eHome Architecture
34Learning from the User
35eHome Agents 2002
- Lighting control agent
- Air condition control agent
- TV program selection agent
36eHome Center
Appliances
User
Server
User Profiles
Information Server
Agent Place
Agent Communities
37Personalization Agent
38Expressiveness of Decision Trees
- All learning can be seen as learning the
representation of a function - Any Boolean function can be written as a decision
tree - If the function is the parity function, then an
exponentially large decision tree is needed - It is also difficult to use a decision tree to
represent a majority function
39Ockhams Razor
- The most likely hypothesis is the simplest one
that is consistent with all observations - Extracting a pattern means being able to describe
a large number of cases in a concise way - Rather than just trying to find a decision tree
that agrees with the examples, we try to find a
concise one, too
40Assessing the Performance of the Learning
Algorithm
- Collect a large set of examples
- Divide it into two disjoint sets the training
set and the test set - Use the learning algorithm with the training set
as examples to generate a hypothesis H - Measure the percentage of examples in the test
set that are correctively classified by H - Repeat above steps for different sizes of
training sets and different randomly selected
training sets of each size
41Learning Curve
- Average prediction quality as a function of the
size of the training set
100
42Noise and Overfitting
- Two or more examples with the same descriptions
but different classifications - In many cases the learning algorithm can use the
irrelevant attributes to make spurious
distinctions among the examples - For decision tree learning, tree-pruning is
useful to deal with overfitting
43Issues in Applications of Decision-Tree Learning
- Missing data
- Multvalued attributes
- Continuous-valued attributes
44Tree Pruning
- The resultant decision tree is often very complex
tree that overfits the examples by inferring more
structure than is justified by the training case - The complex tree can actually have a higher error
rate than a simple tree - Tasks are often at least partly indeterminate
because the attributes do not capture all
information relevant to classification
45Pessimistic Pruning (1)
- physician fee freeze n
- adoption of the budget resolution y D (151)
- adoption of the budget resolution u D (1)
- adoption of the budget resolution n
- education spending n D (6)
- education spending y D (9)
- education spending u R (1)
-
46Pessimistic Pruning (2)
- Subtree
- education spending n D (6)
- education spending y D (9)
- education spending u R (1)
- Error rate
- 6U25(0,6) 9U25(0,9)1U25(0,1) 3.273
- Error rate if replaced by the most common leaf in
the subtree - 16U25(1,16) 2.512
47Pessimistic Pruning (3)
- After pruning
- physician fee freeze n
- adoption of the budget resolution y D (151)
- adoption of the budget resolution u D (1)
- adoption of the budget resolution n D (16/1)
- Error rate of the new subtree
- 151U25(0,151)1U25(0,1)16U25(1,16) 4.642
- Error rate if the new subtree is replaced by the
most common leaf - 168U25(1,168) 2.610
48References for Probability Interval Estimation
- A. M. Mood and F. A. Graybill, Introduction to
the Theory of Statistics, 2nd edition, New York
McGraw-Hill, 1963, Section 11.6. - M. Abramowitz and I. Stegun ed., Handbook of
Mathematical Functions, New York Dover, 1964.
Sections 26.5 and 26.2.
49Estimate of Binomial Probability
- Sample is drawn from a point
binomial with density - The maximum-likelihood estimate of p
- Given n and y k , need to estimate the interval
of p such that the probability that the actual
parameter p falls within the interval equals
ConfidenceLevel
50Interval Estimation
- The upper limit is the p value that satisfies
- The lower limit is the p value that satisfies
51Incomplete Beta Function
- Cumulative of the beta distribution
- Incomplete Beta function
52Connection between Binomial andBeta Distributions
53Upper Limit
54Lower Limit
55Probably Approximately Correct
- Any hypothesis that is seriously wrong will
almost certainly be found out with high
probability after a small number of examples,
because it will make an incorrect prediction - Any hypothesis that is consistent with a
sufficiently large set of training examples is
unlikely to be seriously wrong it must be
probably approximately correct
56Two Activation Functions for Neurons
- Step function
- Sign function
57Neurons as Logical Gates
58Nonlinear Regression
59Perceptrons
60Linearly Separable Functions
?
XOR
AND
OR
- A function that can be represented by a
perceptron if and only if it is linearly separable
61Back Propagation Network
62Changes to the Weight
63Gradient Descent Search