For Wednesday - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

For Wednesday

Description:

Top Down: Search the space of possible derivations of S (e.g.depth first) for ... NP N the man N saw (dead end) NP V the man V saw. NP V Det man Det the ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 40

Provided by: maryelai

Category:

more less

Transcript and Presenter's Notes

Title: For Wednesday

1
For Wednesday

Read chapter 22, sections 4-6
Homework
Chapter 18, exercise 7

2
Program 4

Any questions?

3
Model Neuron(Linear Threshold Unit)

Neuron modelled by a unit (j) connected by
weights, wji, to other units (i)
Net input to a unit is defined as
netj S wji oi
Output of a unit is a threshold function on the
net input
1 if netj gt Tj
0 otherwise

4
MultiLayer Neural Networks

Multilayer networks can represent arbitrary
functions, but building an effective learning
method for such networks was thought to be
difficult.
Generally networks are composed of an input
layer, hidden layer, and output layer and
activation feeds forward from input to output.
Patterns of activation are presented at the
inputs and the resulting activation of the
outputs is computed.
The values of the weights determine the function
computed.
A network with one hidden layer with a sufficient
number of units can represent any boolean
function.

5
Basic Problem

General approach to the learning algorithm is to
apply gradient descent.
However, for the general case, we need to be able
to differentiate the function computed by a unit
and the standard threshold function is not
differentiable at the threshold.

6
Differentiable Threshold Unit

Need some sort of nonlinear output function to
allow computation of arbitary functions by
mulitlayer networks (a multilayer network of
linear units can still only represent a linear
function).
Solution Use a nonlinear, differentiable output
function such as the sigmoid or logistic function
oj 1/(1 e-(netj - Tj) )
Can also use other functions such as tanh or a
Gaussian.

7
Error Measure

Since there are mulitple continuous outputs, we
can define an overall error measure
E(W) 1/2 ( S S (tkd - okd)2)
d?D k?K
where D is the set of training examples, K is
the set of output units, tkd is the target output
for the kth unit given input d, and okd is
network output for the kth unit given input d.

8
Gradient Descent

The derivative of the output of a sigmoid unit
given the net input is
oj/ netj oj(1 - oj)
This can be used to derive a learning rule which
performs gradient descent in weight space in an
attempt to minimize the error function.
?wji -?(?E / ?wji)

9
Backpropogation Learning Rule

Each weight wji is changed by
?wji ?djoi
dj oj (1 - oj) (tj - oj) if j is an output unit
dj oj (1 - oj) Sdk wkj otherwise
where h is a constant called the learning rate,
tj is the correct output for unit j,
dj is an error measure for unit j.
First determine the error for the output units,
then backpropagate this error layer by layer
through the network, changing weights
appropriately at each layer.

10
Backpropogation Learning Algorithm

Create a three layer network with N hidden units
and fully connect input units to hidden units and
hidden units to output units with small random
weights.
Until all examples produce the correct output
within e or the meansquared error ceases to
decrease (or other termination criteria)
Begin epoch
For each example in training set do
Compute the network output for this example.
Compute the error between this output and the
correct output.
Backpropagate this error and adjust weights
to decrease this error.
End epoch
Since continuous outputs only approach 0 or 1 in
the limit, must allow for some eapproximation to
learn binary functions.

11
Comments on Training

There is no guarantee of convergence, may
oscillate or reach a local minima.
However, in practice many large networks can be
adequately trained on large amounts of data for
realistic problems.
Many epochs (thousands) may be needed for
adequate training, large data sets may require
hours or days of CPU time.
Termination criteria can be
Fixed number of epochs
Threshold on training set error

12
Representational Power

Multilayer sigmoidal networks are very
expressive.
Boolean functions Any Boolean function can be
represented by a two layer network by simulating
a twolayer ANDOR network. But number of
required hidden units can grow exponentially in
the number of inputs.
Continuous functions Any bounded continuous
function can be approximated with arbitrarily
small error by a twolayer network. Sigmoid
functions provide a set of basis functions from
which arbitrary functions can be composed, just
as any function can be represented by a sum of
sine waves in Fourier analysis.
Arbitrary functions Any function can be
approximated to arbitarary accuracy by a
threelayer network.

13
Sample Learned XOR Network
3.11
6.96
-7.38
-2.03
B
-5.24
A
-3.58
-5.57
-3.6
-5.74
X
Y

Hidden unit A represents (X Ù Y)
Hidden unit B represents (X Ú Y)
Output O represents A Ù B
(X Ù Y) Ù (X Ú Y)
X Å Y

14
Hidden Unit Representations

Trained hidden units can be seen as newly
constructed features that rerepresent the
examples so that they are linearly separable.
On many real problems, hidden units can end up
representing interesting recognizable features
such as voweldetectors, edgedetectors, etc.
However, particularly with many hidden units,
they become more distributed and are hard to
interpret.

15
Input/Output Coding

Appropriate coding of inputs and outputs can make
learning problem easier and improve
generalization.
Best to encode each binary feature as a separate
input unit and for multivalued features include
one binary unit per value rather than trying to
encode input information in fewer units using
binary coding or continuous values.

16
I/O Coding cont.

Continuous inputs can be handled by a single
input by scaling them between 0 and 1.
For disjoint categorization problems, best to
have one output unit per category rather than
encoding n categories into log n bits. Continuous
output values then represent certainty in various
categories. Assign test cases to the category
with the highest output.
Continuous outputs (regression) can also be
handled by scaling between 0 and 1.

17
Neural Net Conclusions

Learned concepts can be represented by networks
of linear threshold units and trained using
gradient descent.
Analogy to the brain and numerous successful
applications have generated significant interest.
Generally much slower to train than other
learning methods, but exploring a rich hypothesis
space that seems to work well in many domains.
Potential to model biological and cognitive
phenomenon and increase our understanding of real
neural systems.
Backprop itself is not very biologically
plausible

18
Natural Language Processing

Whats the goal?

19
Communication

Communication for the speaker
Intention Decided why, when, and what
information should be transmitted. May require
planning and reasoning about agents' goals and
beliefs.
Generation Translating the information to be
communicated into a string of words.
Synthesis Output of string in desired modality,
e.g.text on a screen or speech.

20
Communication (cont.)

Communication for the hearer
Perception Mapping input modality to a string of
words, e.g. optical character recognition or
speech recognition.
Analysis Determining the information content of
the string.
Syntactic interpretation (parsing) Find correct
parse tree showing the phrase structure
Semantic interpretation Extract (literal)
meaning of the string in some representation,
e.g. FOPC.
Pragmatic interpretation Consider effect of
overall context on the meaning of the sentence
Incorporation Decide whether or not to believe
the content of the string and add it to the KB.

21
Ambiguity

Natural language sentences are highly ambiguous
and must be disambiguated.
I saw the man on the hill with the telescope.
I saw the Grand Canyon flying to LA.
I saw a jet flying to LA.
Time flies like an arrow.
Horse flies like a sugar cube.
Time runners like a coach.
Time cars like a Porsche.

22
Syntax

Syntax concerns the proper ordering of words and
its effect on meaning.
The dog bit the boy.
The boy bit the dog.
Bit boy the dog the
Colorless green ideas sleep furiously.

23
Semantics

Semantics concerns of meaning of words, phrases,
and sentences. Generally restricted to literal
meaning
plant as a photosynthetic organism
plant as a manufacturing facility
plant as the act of sowing

24
Pragmatics

Pragmatics concerns the overall commuinicative
and social context and its effect on
interpretation.
Can you pass the salt?
Passerby Does your dog bite?
Clouseau No.
Passerby (pets dog) Chomp!
I thought you said your dog didn't bite!!
ClouseauThat, sir, is not my dog!

25
Modular Processing
Speech recognition
Parsing
acoustic/ phonetic
syntax
semantics
pragmatics
Sound waves
words
Parse trees
literal meaning
meaning
26
Examples

Phonetics
grey twine vs. great wine
youth in Asia vs. euthanasia
yawanna gt do you want to
Syntax
I ate spaghetti with a fork.
I ate spaghetti with meatballs.

27
More Examples

Semantics
I put the plant in the window.
Ford put the plant in Mexico.
The dog is in the pen.
The ink is in the pen.
Pragmatics
The ham sandwich wants another beer.
John thinks vanilla.

28
Formal Grammars

A grammar is a set of production rules which
generates a set of strings (a language) by
rewriting the top symbol S.
Nonterminal symbols are intermediate results that
are not contained in strings of the language.
S gt NP VP
NP gt Det N
VP gt V NP

Terminal symbols are the final symbols (words)
that compose the strings in the language.
Production rules for generating words from part
of speech categories constitute the lexicon.
N gt boy
V gt eat

30
Context-Free Grammars

A contextfree grammar only has productions with
a single symbol on the lefthand side.
CFG S gt NP V
NP gt Det N
VP gt V NP
not CFG A B gt C
B C gt F G

31
Simplified English Grammar

S gt NP VP S gt VP
NP gt Det Adj N NP gt ProN NP gt PName
VP gt V VP gt V NP VP gt VP PP
PP gt Prep NP
Adj gt e Adj gt Adj Adj
Lexicon
ProN gt I ProN gt you ProN gt he ProN gt she
Name gt John Name gt Mary
Adj gt big Adj gt little Adj gt blue Adj gt
red
Det gt the Det gt a Det gt an
N gt man N gt telescope N gt hill N gt saw
Prep gt with Prep gt for Prep gt of Prep gt in
V gt hit Vgt took Vgt saw V gt likes

32
Parse Trees

A parse tree shows the derivation of a sentence
in the language from the start symbol to the
terminal symbols.
If a given sentence has more than one possible
derivation (parse tree), it is said to be
syntactically ambiguous.

33
(No Transcript)
34
(No Transcript)
35
Syntactic Parsing

Given a string of words, determine if it is
grammatical, i.e. if it can be derived from a
particular grammar.
The derivation itself may also be of interest.
Normally want to determine all possible parse
trees and then use semantics and pragmatics to
eliminate spurious parses and build a semantic
representation.

36
Parsing Complexity

Problem Many sentences have many parses.
An English sentence with n prepositional phrases
at the end has at least 2n parses.
I saw the man on the hill with a telescope on
Tuesday in Austin...
The actual number of parses is given by the
Catalan numbers
1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796...

37
Parsing Algorithms

Top Down Search the space of possible
derivations of S (e.g.depthfirst) for one that
matches the input sentence.
I saw the man.
S gt NP VP
NP gt Det Adj N
Det gt the
Det gt a
Det gt an
NP gt ProN
ProN gt I

VP gt V NP V gt hit V gt took V gt saw
NP gt Det Adj N Det gt the
Adj gt e N gt man
38
Parsing Algorithms (cont.)

Bottom Up Search upward from words finding
larger and larger phrases until a sentence is
found.
I saw the man.
ProN saw the man ProN gt I
NP saw the man NP gt ProN
NP N the man N gt saw (dead end)
NP V the man V gt saw
NP V Det man Det gt the
NP V Det Adj man Adj gt e
NP V Det Adj N N gt man
NP V NP NP gt Det Adj N
NP VP VP gt V NP
S S gt NP VP

39
Bottomup Parsing Algorithm