Grammatical inference Vs Grammar induction - PowerPoint PPT Presentation

1 / 85

About This Presentation

Title:

Grammatical inference Vs Grammar induction

Description:

Credo (1) ... Credo(2) Typical can be: In the limit: learning is always achieved, one day. Probabilistic ... Credo(3) ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 86

Provided by: cdlh8

Category:

more less

Transcript and Presenter's Notes

Title: Grammatical inference Vs Grammar induction

1
Grammatical inference Vs Grammar induction
London 21-22 June 2007
Colin de la Higuera
2
Summary

Why study the algorithms and not the grammars
Learning in the exact setting
Learning in a probabilistic setting

3
1 Why study the process and not the result?

Usual approach in grammatical inference is to
build a grammar (automaton), small and adapted in
some way to the data from which we are supposed
to learn from.

4
Grammatical inference

Is about learning a grammar given information
about a language.

5
Grammar induction

Is about learning a grammar given information
about a language.

6
Difference?
Data
G
7
Motivating example 1

Is 17 a random number?
Is 17 more random than 25?
Suppose I had a random number generator, would I
convince you by showing how well it does on an
example? On various examples ?

(and only slightly provocative)
8
Motivating example 2

Is 01101101101101010110001111 a random sequence?
What about aaabaaabababaabbba?

9
Motivating example 3

Let X be a sample of strings. Is grammar G the
correct grammar for sample X?
Or is it G ?
Correct meaning something like the one we should
learn

10
Back to the definition

Grammar induction and grammatical inference are
about finding a/the grammar from some information
about the language.
But once we have done that, what can we say?

11
What would we like to say?

That the grammar is the smallest, best (re a
score). ? Combinatorial characterisation
What we really want to say is that having solved
some complex combinatorial question we have an
Occam, Compression-MDL-Kolmogorov like argument
proving that what we have found is of interest.

12
What else might we like to say?

That in the near future, given some string, we
can predict if this string belongs to the
language or not.
It would be nice to be able to bet 100 on this.

13
What else would we like to say?

That if the solution we have returned is not
good, then that is because the initial data was
bad (insufficient, biased).
Idea blame the data, not the algorithm.

14
Suppose we cannot say anything of the sort?

Then that means that we may be terribly wrong
even in a favourable setting.

15
Motivating example 4

Suppose we have an algorithm that learns a
grammar by applying iteratively the following two
operations
Merge two non-terminals whenever some nice
MDL-like rule holds
Add a new non-terminal and rule corresponding to
a substring when needed

16
Two learning operators

Creation of non terminals and rules

NP ?ART ADJ NOUN NP ?ART ADJ ADJ NOUN
NP ?ART AP1 NP ?ART ADJ AP1 AP1 ? ADJ NOUN
17

Merging two non terminals

NP ?ART AP1 NP ?ART AP2 AP1 ? ADJ NOUN AP2 ? ADJ
AP1
NP ?ART AP1 AP1 ? ADJ NOUN AP1 ? ADJ AP1
18
What is bound to happen?

We will learn a context-free grammar that can
only generate a regular language.
Brackets are not found.
This is a hidden bias.

19
But how do we say that a learning algorithm is
good?

By accepting the existence of a target.
The question is that of studying the process of
finding this target (or something close to this
target). This is an inference process.

20
If you dont believe there is a target?

Or that the target belongs to another class
You will have to come up with another bias. For
example, believing that simplicity (eg MDL) is
the correct way to handle the question.

21
If you are prepared to accept there is a target
but..

Either the target is known and what is the point
or learning?
Or we dont know it in the practical case (with
this data set) and it is of no use

22
Then you are doing grammar induction.
23
Careful

Some statements that are dangerous
Algorithm A can learn anbncn n?N
Algorithm B can learn this rule with just 2
examples
Looks to me close to wanting free lunch

24
A compromise

You only need to believe there is a target while
evaluating the algorithm.
Then, in practice, there may not be one!

25
End of provocative example

If I run my random number generator and get
999999, I can only keep this number if I believe
in the generator itself.

26
Credo (1)

Grammatical inference is about measuring the
convergence of a grammar learning algorithm in a
typical situation.

27
Credo(2)

Typical can be
In the limit learning is always achieved, one
day
Probabilistic
There is a distribution to be used (Errors are
measurably small)
There is a distribution to be found

28
Credo(3)

Complexity theory should be used the total or
update runtime, the size of the data needed, the
number of mind changes, the number and weight of
errors
should be measured and limited.

29
2 Non probabilistic setting

Identification in the limit
Resource bounded identification in the limit
Active learning (query learning)

30
Identification in the limit

The definitions, presentations
The alternatives
Order free or not
Randomised algorithm

31
A presentation is

a function f N?X
where X is any set,
yields Presentations ? Languages
If f(N)g(N) then yields(f) yields(g)

32
Some presentations (1)

A text presentation of a language L?? is a
function f N ? ? such that f(N)L.
f is an infinite succession of all the elements
of L.
(note small technical difficulty with ?)

33
Some presentations (2)

An informed presentation (or an informant) of
L?? is a function f N ? ? ? -, such that
f(N)(L,)?(L,-)
f is an infinite succession of all the elements
of ? labelled to indicate if they belong or not
to L.

34
Learning function

Given a presentation f, fn is the set of the
first n elements in f.
A learning algorithm a is a function that takes
as input a set fn f(0),,f (n-1) and returns a
grammar.
Given a grammar G, L(G) is the language
generated/recognised/ represented by G.

35
Identification in the limit
f(N)g(N) ?yields(f)yields(g)
yields
A class of languages
L
Pres
? N?X
a
L
A learner
The naming function
G
A class of grammars
?n? N kgtn ? L(a(fk))yields(f)
36
What about efficiency?

We can try to bound
global time
update time
errors before converging
mind changes
queries
good examples needed

37
What should we try to measure?

The size of G ?
The size of L ?
The size of f ?
The size of fn ?

38
Some candidates for polynomial learning

Total runtime polynomial in L
Update runtime polynomial in L
mind changes polynomial in L
implicit prediction errors polynomial in L
Size of characteristic sample polynomial in L

39
f(0)
f(1)
f(n-1)
f(k)
f1
f2
fn
fk
a
a
a
a
G1
G2
Gn
Gn
40
Some selected results (1)
41
Some selected results (2)
42
Some selected results (3)
43
3 Probabilistic setting

Using the distribution to measure error
Identifying the distribution
Approximating the distribution

44
Probabilistic settings

PAC learning
Identification with probability 1
PAC learning distributions

45
Learning a language from sampling

We have a distribution over ?
We sample twice
Once to learn
Once to see how well we have learned
The PAC setting
Probably approximately correct

46
PAC learning(Valiant 84, Pitt 89)

L a set of languages
G a set of grammars
?gt0 and ? gt0
m a maximal length over the strings
n a maximal size of grammars

47
Polynomially PAC learnable

There is an algorithm that samples reasonably and
returns with probability at least 1-? a grammar
that will make at most ? errors.

48
Results

Using cryptographic assumptions, we cannot PAC
learn DFA.
Cannot PAC learn NFA, CFGs with membership
queries either.

49
Learning distributions

No error
Small error

50
No error

This calls for identification in the limit with
probability 1.
Means that the probability of not converging is 0.

51
Results

If probabilities are computable, we can learn
with probability 1 finite state automata.
But not with bounded (polynomial) resources.

52
With error

PAC definition
But error should be measured by a distance
between the target distribution and the
hypothesis
L1,L2,L? ?

53
Results

Too easy with L?
Too hard with L1
Nice algorithms for biased classes of
distributions.

54
For those that are not convinced there is a
difference
55
Structural completeness

Given a sample and a DFA
each edge is used at least onceeach final state
accepts at least one string
Look only at DFA for which the sample is
structurally complete!

not structurally complete
Xaab, b, aaaba, bbaba add ? and abb

a
b
a
a
b
a
b
b
57
Question

Why is the automaton structurally complete for
the sample ?
And not the sample structurally complete for the
automaton ?

58
Some of the many things I have not talked about

Grammatical inference is about new algorithms
Grammatical inference is applied to various
fields pattern recognition, machine translation,
computational biology, NLP, software engineering,
web mining, robotics

59
And

Next ICGI in Britanny in 2008
Some references in the 1 page abstract, others on
the grammatical inference webpage.

60
Appendix, some technicalities
Size of L
Size of f
Size of G
MC
Runtimes
IPE
PAC
CS
61
The size of G G

The size of a grammar is the number of bits
needed to encode the grammar.
Better some value polynomial in the desired
quantity.
Example
DFA of states
CFG of rules length of rules

62
The size of L

If no grammar system is given, meaningless
If G is the class of grammars then L minG
G?G ? L(G)L
Example the size of a regular language when
considering DFA is the number of states of the
minimal DFA that recognizes it.

63
Is a grammar representation reasonable?

Difficult question typical arguments are that
NFA are better than DFA because you can encode
more languages with less bits.
Yet redundancy is necessary!

64
Proposal

A grammar class is reasonable if it encodes
sufficient different languages.
Ie with n bits you have 2n1 encodings so
optimally you should have 2n1 different
languages.
Allow for redundancy and syntaxic sugar, so
p(2n1) different languages.

65
But

We should allow for redundancy and for some
strings that do not encode grammars.
Therefore a grammar representation is reasonable
if there exists a polynomial p() and for any n
the number of different languages encoded by
grammars of size n is at least p(2n)

66
The size of a presentation f

Meaningless. Or at least no convincing definition
comes up.
But when associated with a learner a we can
define the convergence point Cp(f,a) which is the
point at which the learner a finds a grammar for
the correct language L and does not change its
mind.
Cp(f,a)n ?m?n, a(fm) a(fn)?L

67
The size of a finite presentation fn

An easy attempt is n
But then this does not represent the quantity of
information we have received to learn.
A better measure is ?i?nf(i)

68
Quantities associated with learner a

The update runtime time needed to update
hypothesis hn-1 into hn when presented with f(n).
The complete runtime. Time needed to build
hypothesis hn from fn. Also the sum of all
update-runtimes.

69
Definition 1 (total time)

G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), Cp(f,a) ? p(G).
(or global-runtime(a)?p(G))

70
Impossible

Just take some presentation that stays useless
until the bound is reached and then starts
helping.

71
Definition 2 (update polynomial time)

G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), update-runtime(a)?p(G).

72
Doesnt work

We can just differ identification
Here we are measuring the time it takes to build
the next hypothesis.

73
Definition 4 polynomial number of mind changes

G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G),
i a(fi) ? a(fi1) ? p(G).

74
Definition 5 polynomial number of implicit
prediction errors

Denote by G?x if G is incorrect with respect to
an element x of the presentation (i.e. the
algorithm producing G has made an implicit
prediction error.

G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), i a(fi) ? f(i1) ?
p(G).

76
Definition 6 polynomial characteristic sample

G has polynomial characteristic samples for
identification algorithm a if there exists an and
a polynomial p() such that given any G in G, ?Y
correct sample for G, such that when Y?fn,
a(fn)?G and Y ? p(G ).

77
3 Probabilistic setting

Using the distribution to measure error
Identifying the distribution
Approximating the distribution

78
Probabilistic settings

PAC learning
Identification with probability 1
PAC learning distributions

79
Learning a language from sampling

We have a distribution over ?
We sample twice
Once to learn
Once to see how well we have learned
The PAC setting

80
How do we consider a finite set?
?
D
By sampling 1/? ln 1/? examples we can find a
safe m.
81
PAC learning(Valiant 84, Pitt 89)

L a set of languages
G a set of grammars
?gt0 and ? gt0
m a maximal length over the strings
n a maximal size of grammars

H is ? -AC (approximately correct)
if
PrDH(x)?G(x)lt ?

83
L(H)
L(G)
Errors we want L1(D(G),D(H))lt?
84
(No Transcript)
85
b
a
a
a
b
b
Pr(abab)
86
b
0.1
0.9
a
a
0.35
0.7
a
0.7
b
0.65
b
0.3
0.3
87
(No Transcript)
88
(No Transcript)

Write a Comment

User Comments (0)