Title: For Wednesday
1For Wednesday
- Read chapter 22, sections 4-6
- Homework
- Chapter 18, exercise 7
2Program 4
3Model Neuron(Linear Threshold Unit)
- Neuron modelled by a unit (j) connected by
weights, wji, to other units (i) - Net input to a unit is defined as
- netj S wji oi
- Output of a unit is a threshold function on the
net input - 1 if netj gt Tj
- 0 otherwise
4MultiLayer Neural Networks
- Multilayer networks can represent arbitrary
functions, but building an effective learning
method for such networks was thought to be
difficult. - Generally networks are composed of an input
layer, hidden layer, and output layer and
activation feeds forward from input to output. - Patterns of activation are presented at the
inputs and the resulting activation of the
outputs is computed. - The values of the weights determine the function
computed. - A network with one hidden layer with a sufficient
number of units can represent any boolean
function.
5Basic Problem
- General approach to the learning algorithm is to
apply gradient descent. - However, for the general case, we need to be able
to differentiate the function computed by a unit
and the standard threshold function is not
differentiable at the threshold.
6Differentiable Threshold Unit
- Need some sort of nonlinear output function to
allow computation of arbitary functions by
mulitlayer networks (a multilayer network of
linear units can still only represent a linear
function). - Solution Use a nonlinear, differentiable output
function such as the sigmoid or logistic function
- oj 1/(1 e-(netj - Tj) )
- Can also use other functions such as tanh or a
Gaussian.
7Error Measure
- Since there are mulitple continuous outputs, we
can define an overall error measure - E(W) 1/2 ( S S (tkd - okd)2)
- d?D k?K
- where D is the set of training examples, K is
the set of output units, tkd is the target output
for the kth unit given input d, and okd is
network output for the kth unit given input d.
8Gradient Descent
- The derivative of the output of a sigmoid unit
given the net input is - oj/ netj oj(1 - oj)
- This can be used to derive a learning rule which
performs gradient descent in weight space in an
attempt to minimize the error function. - ?wji -?(?E / ?wji)
9Backpropogation Learning Rule
- Each weight wji is changed by
- ?wji ?djoi
- dj oj (1 - oj) (tj - oj) if j is an output unit
- dj oj (1 - oj) Sdk wkj otherwise
- where h is a constant called the learning rate,
- tj is the correct output for unit j,
- dj is an error measure for unit j.
- First determine the error for the output units,
then backpropagate this error layer by layer
through the network, changing weights
appropriately at each layer.
10Backpropogation Learning Algorithm
- Create a three layer network with N hidden units
and fully connect input units to hidden units and
hidden units to output units with small random
weights. - Until all examples produce the correct output
within e or the meansquared error ceases to
decrease (or other termination criteria) - Begin epoch
- For each example in training set do
- Compute the network output for this example.
- Compute the error between this output and the
correct output. - Backpropagate this error and adjust weights
to decrease this error. - End epoch
- Since continuous outputs only approach 0 or 1 in
the limit, must allow for some eapproximation to
learn binary functions.
11Comments on Training
- There is no guarantee of convergence, may
oscillate or reach a local minima. - However, in practice many large networks can be
adequately trained on large amounts of data for
realistic problems. - Many epochs (thousands) may be needed for
adequate training, large data sets may require
hours or days of CPU time. - Termination criteria can be
- Fixed number of epochs
- Threshold on training set error
12Representational Power
- Multilayer sigmoidal networks are very
expressive. - Boolean functions Any Boolean function can be
represented by a two layer network by simulating
a twolayer ANDOR network. But number of
required hidden units can grow exponentially in
the number of inputs. - Continuous functions Any bounded continuous
function can be approximated with arbitrarily
small error by a twolayer network. Sigmoid
functions provide a set of basis functions from
which arbitrary functions can be composed, just
as any function can be represented by a sum of
sine waves in Fourier analysis. - Arbitrary functions Any function can be
approximated to arbitarary accuracy by a
threelayer network.
13Sample Learned XOR Network
3.11
6.96
-7.38
-2.03
B
-5.24
A
-3.58
-5.57
-3.6
-5.74
X
Y
- Hidden unit A represents (X Ù Y)
- Hidden unit B represents (X Ú Y)
- Output O represents A Ù B
- (X Ù Y) Ù (X Ú Y)
- X Å Y
14Hidden Unit Representations
- Trained hidden units can be seen as newly
constructed features that rerepresent the
examples so that they are linearly separable. - On many real problems, hidden units can end up
representing interesting recognizable features
such as voweldetectors, edgedetectors, etc. - However, particularly with many hidden units,
they become more distributed and are hard to
interpret.
15Input/Output Coding
- Appropriate coding of inputs and outputs can make
learning problem easier and improve
generalization. - Best to encode each binary feature as a separate
input unit and for multivalued features include
one binary unit per value rather than trying to
encode input information in fewer units using
binary coding or continuous values.
16I/O Coding cont.
- Continuous inputs can be handled by a single
input by scaling them between 0 and 1. - For disjoint categorization problems, best to
have one output unit per category rather than
encoding n categories into log n bits. Continuous
output values then represent certainty in various
categories. Assign test cases to the category
with the highest output. - Continuous outputs (regression) can also be
handled by scaling between 0 and 1.
17Neural Net Conclusions
- Learned concepts can be represented by networks
of linear threshold units and trained using
gradient descent. - Analogy to the brain and numerous successful
applications have generated significant interest.
- Generally much slower to train than other
learning methods, but exploring a rich hypothesis
space that seems to work well in many domains. - Potential to model biological and cognitive
phenomenon and increase our understanding of real
neural systems. - Backprop itself is not very biologically
plausible
18Natural Language Processing
19Communication
- Communication for the speaker
- Intention Decided why, when, and what
information should be transmitted. May require
planning and reasoning about agents' goals and
beliefs. - Generation Translating the information to be
communicated into a string of words. - Synthesis Output of string in desired modality,
e.g.text on a screen or speech.
20Communication (cont.)
- Communication for the hearer
- Perception Mapping input modality to a string of
words, e.g. optical character recognition or
speech recognition. - Analysis Determining the information content of
the string. - Syntactic interpretation (parsing) Find correct
parse tree showing the phrase structure - Semantic interpretation Extract (literal)
meaning of the string in some representation,
e.g. FOPC. - Pragmatic interpretation Consider effect of
overall context on the meaning of the sentence - Incorporation Decide whether or not to believe
the content of the string and add it to the KB.
21Ambiguity
- Natural language sentences are highly ambiguous
and must be disambiguated. - I saw the man on the hill with the telescope.
- I saw the Grand Canyon flying to LA.
- I saw a jet flying to LA.
- Time flies like an arrow.
- Horse flies like a sugar cube.
- Time runners like a coach.
- Time cars like a Porsche.
22Syntax
- Syntax concerns the proper ordering of words and
its effect on meaning. - The dog bit the boy.
- The boy bit the dog.
- Bit boy the dog the
- Colorless green ideas sleep furiously.
23Semantics
- Semantics concerns of meaning of words, phrases,
and sentences. Generally restricted to literal
meaning - plant as a photosynthetic organism
- plant as a manufacturing facility
- plant as the act of sowing
24Pragmatics
- Pragmatics concerns the overall commuinicative
and social context and its effect on
interpretation. - Can you pass the salt?
- Passerby Does your dog bite?
- Clouseau No.
- Passerby (pets dog) Chomp!
- I thought you said your dog didn't bite!!
- ClouseauThat, sir, is not my dog!
25Modular Processing
Speech recognition
Parsing
acoustic/ phonetic
syntax
semantics
pragmatics
Sound waves
words
Parse trees
literal meaning
meaning
26Examples
- Phonetics
- grey twine vs. great wine
- youth in Asia vs. euthanasia
- yawanna gt do you want to
- Syntax
- I ate spaghetti with a fork.
- I ate spaghetti with meatballs.
27More Examples
- Semantics
- I put the plant in the window.
- Ford put the plant in Mexico.
- The dog is in the pen.
- The ink is in the pen.
- Pragmatics
- The ham sandwich wants another beer.
- John thinks vanilla.
28Formal Grammars
- A grammar is a set of production rules which
generates a set of strings (a language) by
rewriting the top symbol S. - Nonterminal symbols are intermediate results that
are not contained in strings of the language. - S gt NP VP
- NP gt Det N
- VP gt V NP
29- Terminal symbols are the final symbols (words)
that compose the strings in the language. - Production rules for generating words from part
of speech categories constitute the lexicon. - N gt boy
- V gt eat
30Context-Free Grammars
- A contextfree grammar only has productions with
a single symbol on the lefthand side. - CFG S gt NP V
- NP gt Det N
- VP gt V NP
- not CFG A B gt C
- B C gt F G
31Simplified English Grammar
- S gt NP VP S gt VP
- NP gt Det Adj N NP gt ProN NP gt PName
- VP gt V VP gt V NP VP gt VP PP
- PP gt Prep NP
- Adj gt e Adj gt Adj Adj
- Lexicon
- ProN gt I ProN gt you ProN gt he ProN gt she
- Name gt John Name gt Mary
- Adj gt big Adj gt little Adj gt blue Adj gt
red - Det gt the Det gt a Det gt an
- N gt man N gt telescope N gt hill N gt saw
- Prep gt with Prep gt for Prep gt of Prep gt in
- V gt hit Vgt took Vgt saw V gt likes
32Parse Trees
- A parse tree shows the derivation of a sentence
in the language from the start symbol to the
terminal symbols. - If a given sentence has more than one possible
derivation (parse tree), it is said to be
syntactically ambiguous.
33(No Transcript)
34(No Transcript)
35Syntactic Parsing
- Given a string of words, determine if it is
grammatical, i.e. if it can be derived from a
particular grammar. - The derivation itself may also be of interest.
- Normally want to determine all possible parse
trees and then use semantics and pragmatics to
eliminate spurious parses and build a semantic
representation.
36Parsing Complexity
- Problem Many sentences have many parses.
- An English sentence with n prepositional phrases
at the end has at least 2n parses. - I saw the man on the hill with a telescope on
Tuesday in Austin... - The actual number of parses is given by the
Catalan numbers - 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796...
37Parsing Algorithms
- Top Down Search the space of possible
derivations of S (e.g.depthfirst) for one that
matches the input sentence. - I saw the man.
- S gt NP VP
- NP gt Det Adj N
- Det gt the
- Det gt a
- Det gt an
- NP gt ProN
- ProN gt I
VP gt V NP V gt hit V gt took V gt saw
NP gt Det Adj N Det gt the
Adj gt e N gt man
38Parsing Algorithms (cont.)
- Bottom Up Search upward from words finding
larger and larger phrases until a sentence is
found. - I saw the man.
- ProN saw the man ProN gt I
- NP saw the man NP gt ProN
- NP N the man N gt saw (dead end)
- NP V the man V gt saw
- NP V Det man Det gt the
- NP V Det Adj man Adj gt e
- NP V Det Adj N N gt man
- NP V NP NP gt Det Adj N
- NP VP VP gt V NP
- S S gt NP VP
39Bottomup Parsing Algorithm
- function BOTTOMUPPARSE(words, grammar) returns
a parse tree - forest ? words
- loop do
- if LENGTH(forest) 1 and
CATEGORY(forest1) START(grammar) then - return forest1
- else
- i ? choose from 1...LENGTH(forest)
- rule ? choose from RULES(grammar)
- n ? LENGTH(RULERHS(rule))
- subsequence ? SUBSEQUENCE(forest, i, in1)
- if MATCH(subsequence, RULERHS(rule)) then
- foresti...in1 / MAKENODE(RULELHS(rul
e), subsequence) - else fail
- end