Title: Grammatical inference Vs Grammar induction
1Grammatical inference Vs Grammar induction
London 21-22 June 2007
Colin de la Higuera
2Summary
- Why study the algorithms and not the grammars
- Learning in the exact setting
- Learning in a probabilistic setting
31 Why study the process and not the result?
- Usual approach in grammatical inference is to
build a grammar (automaton), small and adapted in
some way to the data from which we are supposed
to learn from.
4Grammatical inference
- Is about learning a grammar given information
about a language.
5Grammar induction
- Is about learning a grammar given information
about a language.
6Difference?
Data
G
7Motivating example 1
- Is 17 a random number?
- Is 17 more random than 25?
- Suppose I had a random number generator, would I
convince you by showing how well it does on an
example? On various examples ?
(and only slightly provocative)
8Motivating example 2
- Is 01101101101101010110001111 a random sequence?
- What about aaabaaabababaabbba?
9Motivating example 3
- Let X be a sample of strings. Is grammar G the
correct grammar for sample X? - Or is it G ?
- Correct meaning something like the one we should
learn
10Back to the definition
- Grammar induction and grammatical inference are
about finding a/the grammar from some information
about the language. - But once we have done that, what can we say?
11What would we like to say?
- That the grammar is the smallest, best (re a
score). ? Combinatorial characterisation - What we really want to say is that having solved
some complex combinatorial question we have an
Occam, Compression-MDL-Kolmogorov like argument
proving that what we have found is of interest.
12What else might we like to say?
- That in the near future, given some string, we
can predict if this string belongs to the
language or not. - It would be nice to be able to bet 100 on this.
13What else would we like to say?
- That if the solution we have returned is not
good, then that is because the initial data was
bad (insufficient, biased). - Idea blame the data, not the algorithm.
14Suppose we cannot say anything of the sort?
- Then that means that we may be terribly wrong
even in a favourable setting.
15Motivating example 4
- Suppose we have an algorithm that learns a
grammar by applying iteratively the following two
operations - Merge two non-terminals whenever some nice
MDL-like rule holds - Add a new non-terminal and rule corresponding to
a substring when needed
16Two learning operators
- Creation of non terminals and rules
NP ?ART ADJ NOUN NP ?ART ADJ ADJ NOUN
NP ?ART AP1 NP ?ART ADJ AP1 AP1 ? ADJ NOUN
17- Merging two non terminals
NP ?ART AP1 NP ?ART AP2 AP1 ? ADJ NOUN AP2 ? ADJ
AP1
NP ?ART AP1 AP1 ? ADJ NOUN AP1 ? ADJ AP1
18What is bound to happen?
- We will learn a context-free grammar that can
only generate a regular language. - Brackets are not found.
- This is a hidden bias.
19But how do we say that a learning algorithm is
good?
- By accepting the existence of a target.
- The question is that of studying the process of
finding this target (or something close to this
target). This is an inference process.
20If you dont believe there is a target?
- Or that the target belongs to another class
- You will have to come up with another bias. For
example, believing that simplicity (eg MDL) is
the correct way to handle the question.
21If you are prepared to accept there is a target
but..
- Either the target is known and what is the point
or learning? - Or we dont know it in the practical case (with
this data set) and it is of no use
22Then you are doing grammar induction.
23Careful
- Some statements that are dangerous
- Algorithm A can learn anbncn n?N
- Algorithm B can learn this rule with just 2
examples - Looks to me close to wanting free lunch
24A compromise
- You only need to believe there is a target while
evaluating the algorithm. - Then, in practice, there may not be one!
25End of provocative example
- If I run my random number generator and get
999999, I can only keep this number if I believe
in the generator itself.
26Credo (1)
- Grammatical inference is about measuring the
convergence of a grammar learning algorithm in a
typical situation.
27Credo(2)
- Typical can be
- In the limit learning is always achieved, one
day - Probabilistic
- There is a distribution to be used (Errors are
measurably small) - There is a distribution to be found
28Credo(3)
- Complexity theory should be used the total or
update runtime, the size of the data needed, the
number of mind changes, the number and weight of
errors - should be measured and limited.
292 Non probabilistic setting
- Identification in the limit
- Resource bounded identification in the limit
- Active learning (query learning)
30Identification in the limit
- The definitions, presentations
- The alternatives
- Order free or not
- Randomised algorithm
31A presentation is
- a function f N?X
- where X is any set,
- yields Presentations ? Languages
- If f(N)g(N) then yields(f) yields(g)
32Some presentations (1)
- A text presentation of a language L?? is a
function f N ? ? such that f(N)L. - f is an infinite succession of all the elements
of L. - (note small technical difficulty with ?)
33Some presentations (2)
- An informed presentation (or an informant) of
L?? is a function f N ? ? ? -, such that
f(N)(L,)?(L,-) - f is an infinite succession of all the elements
of ? labelled to indicate if they belong or not
to L.
34Learning function
- Given a presentation f, fn is the set of the
first n elements in f. - A learning algorithm a is a function that takes
as input a set fn f(0),,f (n-1) and returns a
grammar. - Given a grammar G, L(G) is the language
generated/recognised/ represented by G.
35Identification in the limit
f(N)g(N) ?yields(f)yields(g)
yields
A class of languages
L
Pres
? N?X
a
L
A learner
The naming function
G
A class of grammars
?n? N kgtn ? L(a(fk))yields(f)
36What about efficiency?
- We can try to bound
- global time
- update time
- errors before converging
- mind changes
- queries
- good examples needed
37What should we try to measure?
- The size of G ?
- The size of L ?
- The size of f ?
- The size of fn ?
38Some candidates for polynomial learning
- Total runtime polynomial in L
- Update runtime polynomial in L
- mind changes polynomial in L
- implicit prediction errors polynomial in L
- Size of characteristic sample polynomial in L
39f(0)
f(1)
f(n-1)
f(k)
f1
f2
fn
fk
a
a
a
a
G1
G2
Gn
Gn
40Some selected results (1)
41Some selected results (2)
42Some selected results (3)
433 Probabilistic setting
- Using the distribution to measure error
- Identifying the distribution
- Approximating the distribution
44Probabilistic settings
- PAC learning
- Identification with probability 1
- PAC learning distributions
45Learning a language from sampling
- We have a distribution over ?
- We sample twice
- Once to learn
- Once to see how well we have learned
- The PAC setting
- Probably approximately correct
46PAC learning(Valiant 84, Pitt 89)
- L a set of languages
- G a set of grammars
- ?gt0 and ? gt0
- m a maximal length over the strings
- n a maximal size of grammars
47Polynomially PAC learnable
- There is an algorithm that samples reasonably and
returns with probability at least 1-? a grammar
that will make at most ? errors.
48Results
- Using cryptographic assumptions, we cannot PAC
learn DFA. - Cannot PAC learn NFA, CFGs with membership
queries either.
49Learning distributions
50No error
- This calls for identification in the limit with
probability 1. - Means that the probability of not converging is 0.
51Results
- If probabilities are computable, we can learn
with probability 1 finite state automata. - But not with bounded (polynomial) resources.
52With error
- PAC definition
- But error should be measured by a distance
between the target distribution and the
hypothesis - L1,L2,L? ?
53Results
- Too easy with L?
- Too hard with L1
- Nice algorithms for biased classes of
distributions.
54For those that are not convinced there is a
difference
55Structural completeness
- Given a sample and a DFA
- each edge is used at least onceeach final state
accepts at least one string - Look only at DFA for which the sample is
structurally complete!
56- not structurally complete
- Xaab, b, aaaba, bbaba add ? and abb
a
b
a
a
b
a
b
b
57Question
- Why is the automaton structurally complete for
the sample ? - And not the sample structurally complete for the
automaton ?
58Some of the many things I have not talked about
- Grammatical inference is about new algorithms
- Grammatical inference is applied to various
fields pattern recognition, machine translation,
computational biology, NLP, software engineering,
web mining, robotics
59And
- Next ICGI in Britanny in 2008
- Some references in the 1 page abstract, others on
the grammatical inference webpage.
60Appendix, some technicalities
Size of L
Size of f
Size of G
MC
Runtimes
IPE
PAC
CS
61The size of G G
- The size of a grammar is the number of bits
needed to encode the grammar. - Better some value polynomial in the desired
quantity. - Example
- DFA of states
- CFG of rules length of rules
62The size of L
- If no grammar system is given, meaningless
- If G is the class of grammars then L minG
G?G ? L(G)L - Example the size of a regular language when
considering DFA is the number of states of the
minimal DFA that recognizes it.
63Is a grammar representation reasonable?
- Difficult question typical arguments are that
NFA are better than DFA because you can encode
more languages with less bits. - Yet redundancy is necessary!
64Proposal
- A grammar class is reasonable if it encodes
sufficient different languages. - Ie with n bits you have 2n1 encodings so
optimally you should have 2n1 different
languages. - Allow for redundancy and syntaxic sugar, so
p(2n1) different languages.
65But
- We should allow for redundancy and for some
strings that do not encode grammars. - Therefore a grammar representation is reasonable
if there exists a polynomial p() and for any n
the number of different languages encoded by
grammars of size n is at least p(2n)
66The size of a presentation f
- Meaningless. Or at least no convincing definition
comes up. - But when associated with a learner a we can
define the convergence point Cp(f,a) which is the
point at which the learner a finds a grammar for
the correct language L and does not change its
mind. - Cp(f,a)n ?m?n, a(fm) a(fn)?L
67The size of a finite presentation fn
- An easy attempt is n
- But then this does not represent the quantity of
information we have received to learn. - A better measure is ?i?nf(i)
68Quantities associated with learner a
- The update runtime time needed to update
hypothesis hn-1 into hn when presented with f(n). - The complete runtime. Time needed to build
hypothesis hn from fn. Also the sum of all
update-runtimes.
69Definition 1 (total time)
- G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), Cp(f,a) ? p(G). - (or global-runtime(a)?p(G))
70Impossible
- Just take some presentation that stays useless
until the bound is reached and then starts
helping.
71Definition 2 (update polynomial time)
- G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), update-runtime(a)?p(G).
72Doesnt work
- We can just differ identification
- Here we are measuring the time it takes to build
the next hypothesis.
73Definition 4 polynomial number of mind changes
- G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), - i a(fi) ? a(fi1) ? p(G).
74Definition 5 polynomial number of implicit
prediction errors
- Denote by G?x if G is incorrect with respect to
an element x of the presentation (i.e. the
algorithm producing G has made an implicit
prediction error.
75 - G is polynomially identifiable in the limit from
Pres if there exists an identification algorithm
a and a polynomial p() such that given any G in
G, and given any presentation f such that
yields(f)L(G), i a(fi) ? f(i1) ?
p(G).
76Definition 6 polynomial characteristic sample
- G has polynomial characteristic samples for
identification algorithm a if there exists an and
a polynomial p() such that given any G in G, ?Y
correct sample for G, such that when Y?fn,
a(fn)?G and Y ? p(G ).
773 Probabilistic setting
- Using the distribution to measure error
- Identifying the distribution
- Approximating the distribution
78Probabilistic settings
- PAC learning
- Identification with probability 1
- PAC learning distributions
79Learning a language from sampling
- We have a distribution over ?
- We sample twice
- Once to learn
- Once to see how well we have learned
- The PAC setting
80How do we consider a finite set?
?
D
By sampling 1/? ln 1/? examples we can find a
safe m.
81PAC learning(Valiant 84, Pitt 89)
- L a set of languages
- G a set of grammars
- ?gt0 and ? gt0
- m a maximal length over the strings
- n a maximal size of grammars
82- H is ? -AC (approximately correct)
-
- if
- PrDH(x)?G(x)lt ?
83L(H)
L(G)
Errors we want L1(D(G),D(H))lt?
84(No Transcript)
85b
a
a
a
b
b
Pr(abab)
86b
0.1
0.9
a
a
0.35
0.7
a
0.7
b
0.65
b
0.3
0.3
87(No Transcript)
88(No Transcript)