Lambert%20Schomaker presentation

About This Presentation

Transcript and Presenter's Notes

Title: Lambert%20Schomaker

1
KI2 5 Grammar inference

Lambert Schomaker

Kunstmatige Intelligentie / RuG
2
Grammar inference (GI)

methods, aimed at uncovering the grammar which
underlies an observed sequence of tokens
Two variants
explicit, formal GI deterministic
token
generators
implicit, statistical GI stochastic
token
generators

3
Grammar inference

AABBCCAA..(?).. whats next?
ABA ? 1A 1B 1A
AABBAA ? 2A 2B 2A
or
? AAB(mirrorsymmetric)
? (2A B)(mirrored)
repetition, mirrorring, insertion, substitution

4
Strings of tokens

DNA ACTGAGGACCTGAC
output of speech recognizers
words from an unknown language
tokenized patterns in the real world

5
Strings of tokens

DNA ACTGAGGACCTGAC
output of speech recognizers
words from an unknown language
tokenized patterns in the real world

A B A
6
Strings of tokens

DNA ACTGAGGACCTGAC
output of speech recognizers
words from an unknown language
tokenized patterns in the real world

A B A ? Symm(B,A)
7
GI

induction of structural patterns from
observed data
representation by a formal grammar
versus
emulating the underlying grammar withoutmaking
the rules explicit (NN,HMM)

8
GI, the engine
Grammar Induction
Data
Grammatical rules
(seq (repeat 3 a)(repeat 3 b)) (seq a
b) (symmetry (repeat 2 c) (seq a b))
aaabbb ab abccba
9
The hypothesis behind GI
Grammar Induction
Data
Generator process
G
G0
aaabbb ab abccba
Find G ? G0
10
The hypothesis behind GI
Grammar Induction
Data
Generator process
G
G0
aaabbb ab abccba
Find G ? G0
It is not claimed that G0 actually exists
11
Learning

Until now it was implicitly assumed that the data
consists of positive examples
A very large amount of data is needed to induce
an underlying grammar
It is difficult to find a good approximation to
G0 if there are no negative examples
e.g. aaxybb does NOT belong to the grammar

12
Learning
Convergence G0 G is assumed for infinite N
Grammar Induction
Data
Generator process
G
G0
sample1
G1 sample2
G12 sample3

G123 . . . sampleN
G
13
Learning
(Convergence G0 G is assumed for infinite
N) More realistic a PAC, probably approximately
correct G
Grammar Induction
Data
Generator process
G
G0
sample1
G1 sample2
G12 sample3

G123 . . . sampleN
G
14
PAC GI
the language generated by G0
the language explained by G
L(G0)
L(G)
P p(L(G0) ? L(G)) lt ? gt (1 - ?)
15
PAC GI
the language generated by G0
the language explained by G
L(G0)
L(G)
The probability that the probability of finding
elements L0 xor L is smaller than ?, will be
larger than 1- ?
P p(L(G0) ? L(G)) lt ? gt (1 - ?)
16
Example

S aa, aba, abba, abbba

a
a
?
aa
a
17
Example

S aa, aba, abba, abbba

a
a
?
a
aa
b
a
ab
ba
18
Example

S aa, aba, abba, abbba

a
a
?
a
aa
b
a
ab
ba
bb
b
a
19
Example

S aa, aba, abba, abbba

a
a
?
a
aa
b
a
ab
ba
bb
b
a
b
20
Many GI approaches are known (Dupont, 1997)
21
Second group Grammar Emulation

Statistical methods, aiming at producing token
sequences with the same statistical properties as
the generator grammar G0
1 recurrent neural networks
2 Markov models
3 hidden-Markov models

22
Grammar emulation, training
ABGBABGACTVYAB ltxgt. . .
predict x
context window
Grammar emulator
23
Recurrent neural networks for grammar emulation

Major types
Jordan (output-layer recurrence)
Elman (hidden-layer recurrence)

24
Jordan MLPs

Assumption current state is represented by
output unit activation at the previous time
step(s) and by the current input

Input state
Output
?t
25
Elman MLPs

Assumption current state is represented by
hidden unit activation at the previous time
step(s) and by the current input

Input state
Output
?t
26
Markov variants

Shannon fixed 5-letter window for English to
predict next letter
Variable-length Markov Models (VLMM)
(Guyon Pereira)
idea the width of the context window
to predict the next token in a sequence
is variable and depends on statistics

27
Results

Example output of letter VLMM, trained on news
item texts (250 MB training set)
liferator member of flight since N. a report
the managical including from C all N months after
dispute. C and declaracter leaders first to do a
lot of though a ground out and C C pairs due to
each planner of the lux said the C nailed by the
defender begin about in N. the spokesman
standards of the arms responded victory the side
honored by the accustomers was arrest two
mentalisting the romatory accustomers of ethnic C
C the procedure.

Write a Comment

User Comments (0)

About PowerShow.com

Lambert%20Schomaker PowerPoint PPT Presentation