Grammatical inference Vs Grammar induction - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Grammatical inference Vs Grammar induction

Description:

Credo (1) ... Credo(2) Typical can be: In the limit: learning is always achieved, one day. Probabilistic ... Credo(3) ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 86
Provided by: cdlh8
Category:

less

Transcript and Presenter's Notes

Title: Grammatical inference Vs Grammar induction


1
Grammatical inference Vs Grammar induction
London 21-22 June 2007
Colin de la Higuera
2
Summary
  • Why study the algorithms and not the grammars
  • Learning in the exact setting
  • Learning in a probabilistic setting

3
1 Why study the process and not the result?
  • Usual approach in grammatical inference is to
    build a grammar (automaton), small and adapted in
    some way to the data from which we are supposed
    to learn from.

4
Grammatical inference
  • Is about learning a grammar given information
    about a language.

5
Grammar induction
  • Is about learning a grammar given information
    about a language.

6
Difference?
Data
G
7
Motivating example 1
  • Is 17 a random number?
  • Is 17 more random than 25?
  • Suppose I had a random number generator, would I
    convince you by showing how well it does on an
    example? On various examples ?

(and only slightly provocative)
8
Motivating example 2
  • Is 01101101101101010110001111 a random sequence?
  • What about aaabaaabababaabbba?

9
Motivating example 3
  • Let X be a sample of strings. Is grammar G the
    correct grammar for sample X?
  • Or is it G ?
  • Correct meaning something like the one we should
    learn

10
Back to the definition
  • Grammar induction and grammatical inference are
    about finding a/the grammar from some information
    about the language.
  • But once we have done that, what can we say?

11
What would we like to say?
  • That the grammar is the smallest, best (re a
    score). ? Combinatorial characterisation
  • What we really want to say is that having solved
    some complex combinatorial question we have an
    Occam, Compression-MDL-Kolmogorov like argument
    proving that what we have found is of interest.

12
What else might we like to say?
  • That in the near future, given some string, we
    can predict if this string belongs to the
    language or not.
  • It would be nice to be able to bet 100 on this.

13
What else would we like to say?
  • That if the solution we have returned is not
    good, then that is because the initial data was
    bad (insufficient, biased).
  • Idea blame the data, not the algorithm.

14
Suppose we cannot say anything of the sort?
  • Then that means that we may be terribly wrong
    even in a favourable setting.

15
Motivating example 4
  • Suppose we have an algorithm that learns a
    grammar by applying iteratively the following two
    operations
  • Merge two non-terminals whenever some nice
    MDL-like rule holds
  • Add a new non-terminal and rule corresponding to
    a substring when needed

16
Two learning operators
  • Creation of non terminals and rules

NP ?ART ADJ NOUN NP ?ART ADJ ADJ NOUN
NP ?ART AP1 NP ?ART ADJ AP1 AP1 ? ADJ NOUN
17
  • Merging two non terminals

NP ?ART AP1 NP ?ART AP2 AP1 ? ADJ NOUN AP2 ? ADJ
AP1
NP ?ART AP1 AP1 ? ADJ NOUN AP1 ? ADJ AP1
18
What is bound to happen?
  • We will learn a context-free grammar that can
    only generate a regular language.
  • Brackets are not found.
  • This is a hidden bias.

19
But how do we say that a learning algorithm is
good?
  • By accepting the existence of a target.
  • The question is that of studying the process of
    finding this target (or something close to this
    target). This is an inference process.

20
If you dont believe there is a target?
  • Or that the target belongs to another class
  • You will have to come up with another bias. For
    example, believing that simplicity (eg MDL) is
    the correct way to handle the question.

21
If you are prepared to accept there is a target
but..
  • Either the target is known and what is the point
    or learning?
  • Or we dont know it in the practical case (with
    this data set) and it is of no use

22
Then you are doing grammar induction.
23
Careful
  • Some statements that are dangerous
  • Algorithm A can learn anbncn n?N
  • Algorithm B can learn this rule with just 2
    examples
  • Looks to me close to wanting free lunch

24
A compromise
  • You only need to believe there is a target while
    evaluating the algorithm.
  • Then, in practice, there may not be one!

25
End of provocative example
  • If I run my random number generator and get
    999999, I can only keep this number if I believe
    in the generator itself.

26
Credo (1)
  • Grammatical inference is about measuring the
    convergence of a grammar learning algorithm in a
    typical situation.

27
Credo(2)
  • Typical can be
  • In the limit learning is always achieved, one
    day
  • Probabilistic
  • There is a distribution to be used (Errors are
    measurably small)
  • There is a distribution to be found

28
Credo(3)
  • Complexity theory should be used the total or
    update runtime, the size of the data needed, the
    number of mind changes, the number and weight of
    errors
  • should be measured and limited.

29
2 Non probabilistic setting
  • Identification in the limit
  • Resource bounded identification in the limit
  • Active learning (query learning)

30
Identification in the limit
  • The definitions, presentations
  • The alternatives
  • Order free or not
  • Randomised algorithm

31
A presentation is
  • a function f N?X
  • where X is any set,
  • yields Presentations ? Languages
  • If f(N)g(N) then yields(f) yields(g)

32
Some presentations (1)
  • A text presentation of a language L?? is a
    function f N ? ? such that f(N)L.
  • f is an infinite succession of all the elements
    of L.
  • (note small technical difficulty with ?)

33
Some presentations (2)
  • An informed presentation (or an informant) of
    L?? is a function f N ? ? ? -, such that
    f(N)(L,)?(L,-)
  • f is an infinite succession of all the elements
    of ? labelled to indicate if they belong or not
    to L.

34
Learning function
  • Given a presentation f, fn is the set of the
    first n elements in f.
  • A learning algorithm a is a function that takes
    as input a set fn f(0),,f (n-1) and returns a
    grammar.
  • Given a grammar G, L(G) is the language
    generated/recognised/ represented by G.

35
Identification in the limit
f(N)g(N) ?yields(f)yields(g)
yields
A class of languages
L
Pres
? N?X
a
L
A learner
The naming function
G
A class of grammars
?n? N kgtn ? L(a(fk))yields(f)
36
What about efficiency?
  • We can try to bound
  • global time
  • update time
  • errors before converging
  • mind changes
  • queries
  • good examples needed

37
What should we try to measure?
  • The size of G ?
  • The size of L ?
  • The size of f ?
  • The size of fn ?

38
Some candidates for polynomial learning
  • Total runtime polynomial in L
  • Update runtime polynomial in L
  • mind changes polynomial in L
  • implicit prediction errors polynomial in L
  • Size of characteristic sample polynomial in L

39
f(0)
f(1)
f(n-1)
f(k)
f1
f2
fn
fk
a
a
a
a
G1
G2
Gn
Gn
40
Some selected results (1)
41
Some selected results (2)
42
Some selected results (3)
43
3 Probabilistic setting
  • Using the distribution to measure error
  • Identifying the distribution
  • Approximating the distribution

44
Probabilistic settings
  • PAC learning
  • Identification with probability 1
  • PAC learning distributions

45
Learning a language from sampling
  • We have a distribution over ?
  • We sample twice
  • Once to learn
  • Once to see how well we have learned
  • The PAC setting
  • Probably approximately correct

46
PAC learning(Valiant 84, Pitt 89)
  • L a set of languages
  • G a set of grammars
  • ?gt0 and ? gt0
  • m a maximal length over the strings
  • n a maximal size of grammars

47
Polynomially PAC learnable
  • There is an algorithm that samples reasonably and
    returns with probability at least 1-? a grammar
    that will make at most ? errors.

48
Results
  • Using cryptographic assumptions, we cannot PAC
    learn DFA.
  • Cannot PAC learn NFA, CFGs with membership
    queries either.

49
Learning distributions
  • No error
  • Small error

50
No error
  • This calls for identification in the limit with
    probability 1.
  • Means that the probability of not converging is 0.

51
Results
  • If probabilities are computable, we can learn
    with probability 1 finite state automata.
  • But not with bounded (polynomial) resources.

52
With error
  • PAC definition
  • But error should be measured by a distance
    between the target distribution and the
    hypothesis
  • L1,L2,L? ?

53
Results
  • Too easy with L?
  • Too hard with L1
  • Nice algorithms for biased classes of
    distributions.

54
For those that are not convinced there is a
difference
55
Structural completeness
  • Given a sample and a DFA
  • each edge is used at least onceeach final state
    accepts at least one string
  • Look only at DFA for which the sample is
    structurally complete!

56
  • not structurally complete
  • Xaab, b, aaaba, bbaba add ? and abb

a
b
a
a
b
a
b
b
57
Question
  • Why is the automaton structurally complete for
    the sample ?
  • And not the sample structurally complete for the
    automaton ?

58
Some of the many things I have not talked about
  • Grammatical inference is about new algorithms
  • Grammatical inference is applied to various
    fields pattern recognition, machine translation,
    computational biology, NLP, software engineering,
    web mining, robotics

59
And
  • Next ICGI in Britanny in 2008
  • Some references in the 1 page abstract, others on
    the grammatical inference webpage.

60
Appendix, some technicalities
Size of L
Size of f
Size of G
MC
Runtimes
IPE
PAC
CS
61
The size of G G
  • The size of a grammar is the number of bits
    needed to encode the grammar.
  • Better some value polynomial in the desired
    quantity.
  • Example
  • DFA of states
  • CFG of rules length of rules

62
The size of L
  • If no grammar system is given, meaningless
  • If G is the class of grammars then L minG
    G?G ? L(G)L
  • Example the size of a regular language when
    considering DFA is the number of states of the
    minimal DFA that recognizes it.

63
Is a grammar representation reasonable?
  • Difficult question typical arguments are that
    NFA are better than DFA because you can encode
    more languages with less bits.
  • Yet redundancy is necessary!

64
Proposal
  • A grammar class is reasonable if it encodes
    sufficient different languages.
  • Ie with n bits you have 2n1 encodings so
    optimally you should have 2n1 different
    languages.
  • Allow for redundancy and syntaxic sugar, so
    p(2n1) different languages.

65
But
  • We should allow for redundancy and for some
    strings that do not encode grammars.
  • Therefore a grammar representation is reasonable
    if there exists a polynomial p() and for any n
    the number of different languages encoded by
    grammars of size n is at least p(2n)

66
The size of a presentation f
  • Meaningless. Or at least no convincing definition
    comes up.
  • But when associated with a learner a we can
    define the convergence point Cp(f,a) which is the
    point at which the learner a finds a grammar for
    the correct language L and does not change its
    mind.
  • Cp(f,a)n ?m?n, a(fm) a(fn)?L

67
The size of a finite presentation fn
  • An easy attempt is n
  • But then this does not represent the quantity of
    information we have received to learn.
  • A better measure is ?i?nf(i)

68
Quantities associated with learner a
  • The update runtime time needed to update
    hypothesis hn-1 into hn when presented with f(n).
  • The complete runtime. Time needed to build
    hypothesis hn from fn. Also the sum of all
    update-runtimes.

69
Definition 1 (total time)
  • G is polynomially identifiable in the limit from
    Pres if there exists an identification algorithm
    a and a polynomial p() such that given any G in
    G, and given any presentation f such that
    yields(f)L(G), Cp(f,a) ? p(G).
  • (or global-runtime(a)?p(G))

70
Impossible
  • Just take some presentation that stays useless
    until the bound is reached and then starts
    helping.

71
Definition 2 (update polynomial time)
  • G is polynomially identifiable in the limit from
    Pres if there exists an identification algorithm
    a and a polynomial p() such that given any G in
    G, and given any presentation f such that
    yields(f)L(G), update-runtime(a)?p(G).

72
Doesnt work
  • We can just differ identification
  • Here we are measuring the time it takes to build
    the next hypothesis.

73
Definition 4 polynomial number of mind changes
  • G is polynomially identifiable in the limit from
    Pres if there exists an identification algorithm
    a and a polynomial p() such that given any G in
    G, and given any presentation f such that
    yields(f)L(G),
  • i a(fi) ? a(fi1) ? p(G).

74
Definition 5 polynomial number of implicit
prediction errors
  • Denote by G?x if G is incorrect with respect to
    an element x of the presentation (i.e. the
    algorithm producing G has made an implicit
    prediction error.

75
  • G is polynomially identifiable in the limit from
    Pres if there exists an identification algorithm
    a and a polynomial p() such that given any G in
    G, and given any presentation f such that
    yields(f)L(G), i a(fi) ? f(i1) ?
    p(G).

76
Definition 6 polynomial characteristic sample
  • G has polynomial characteristic samples for
    identification algorithm a if there exists an and
    a polynomial p() such that given any G in G, ?Y
    correct sample for G, such that when Y?fn,
    a(fn)?G and Y ? p(G ).

77
3 Probabilistic setting
  • Using the distribution to measure error
  • Identifying the distribution
  • Approximating the distribution

78
Probabilistic settings
  • PAC learning
  • Identification with probability 1
  • PAC learning distributions

79
Learning a language from sampling
  • We have a distribution over ?
  • We sample twice
  • Once to learn
  • Once to see how well we have learned
  • The PAC setting

80
How do we consider a finite set?
?
D
By sampling 1/? ln 1/? examples we can find a
safe m.
81
PAC learning(Valiant 84, Pitt 89)
  • L a set of languages
  • G a set of grammars
  • ?gt0 and ? gt0
  • m a maximal length over the strings
  • n a maximal size of grammars

82
  • H is ? -AC (approximately correct)
  • if
  • PrDH(x)?G(x)lt ?

83
L(H)
L(G)
Errors we want L1(D(G),D(H))lt?
84
(No Transcript)
85
b
a
a
a
b
b
Pr(abab)
86
b
0.1
0.9
a
a
0.35
0.7
a
0.7
b
0.65
b
0.3
0.3
87
(No Transcript)
88
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com