Title: A Corpusbased Technique for Grammar Development
1A Corpus-based Technique for Grammar Development
- Philippe Blache, Marie-Laure Guénot, Tristan van
Rullen
Laboratoire Parole et Langage CNRS Université
de Provence France
Corpus Linguistics SPROLAC Workshop Lancaster,
March 26th, 2003
2Outline
- Overview of the step-by-step Grammar development
process - The formalism of Property Grammars
- Ressources, tools and their use
- Some details on parsing tools
- Deep parser
- Shallow parser
- Multiplexer
- Conclusions and perspectives
3Step by step grammar development
- A fully constraint-based approach
- Broad-coverage grammars
- Several parsing tools
- For development
- For evaluation
4Overview of the development process
completion
tagged Corpus
? Versioning ?
Property Grammar Version 1
Property Grammar Version n
non-determinist Deep parser
parse result R1
parse result Rn
? Analysis ?
Versioning stage on parts of a grammar
Syntactic phenomena interpretation
Different tagged Large corpora
releasing
Property Grammar Version n
Property Grammar Version n1
Shallow parser
parse result Rn
parse result Rn1
Large tests stage on whole grammar releases
Multiplexer
Multiplexed output and statistics about results
Interpretation of modifications with the
indications of statistics
5The formalism of Property Grammars
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
6The formalism of Property Grammars
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
7The formalism of Property Grammars
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
8The formalism of Property Grammars
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
9The formalism of Property Grammars
the Det most requested touristic flights N
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
10The formalism of Property Grammars
the most requested touristic A flights N
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
11The formalism of Property Grammars
the Det most requested touristic flights N
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
12The formalism of Property Grammars
the Det most requested touristic flights N
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
13The formalism of Property Grammars
- Totally constraint-based
- Properties (constraints) relations between
categories of the same level
- No explicit mention of constituency
- The set of properties describing a
- category forms a graph
14Parsing with Property Grammars
- Parsing constraint satisfaction
- Characterization (parsing result) state of the
constraint system, i.e. set of satisfied
violated constraints - Identification of a set of categories
- Identification of its relevant properties
(evaluated) - Building a characterization graph
- Whatever the input
- Unrestricted texts, spoken language corpora
- All constraints are at the same level, and are
independent - Separate evaluation is possible
15Parsing with Property Grammars
16Parsing with Property Grammars
- Syntactic description is only based on constraint
satisfaction - no derivation relation
- no need for a grammar to be complete, coherent
nor consistent to be evaluated - possible representation of partial information
partial structures - ? Flexibility
17Ressources different corpora
- A French treebank
- 6500 tagged and disambiguished sentences among a
corpus of 13000 journalistic sentences - Large corpora (160.000.000 words)
- French newspapers
- Novels
- Oral transcriptions
18Parsing Tools
- Non-deterministic deep parser
- syntactic phenomena identification
- grammar completion experimentation
- Deterministic shallow parser
- systematic evaluation on unrestricted data
- robustness and efficiency
- Multiplexer
- statistics about results
19Parsing Tools
- Non-deterministic deep parser
- syntactic phenomena identification
- grammar completion experimentation
- Deterministic shallow parser
- systematic evaluation on unrestricted data
- robustness and efficiency
- Multiplexer
- statistics about results
20Parsing Tools
- Non-deterministic deep parser
- syntactic phenomena identification
- grammar completion experimentation
- Deterministic shallow parser
- systematic evaluation on unrestricted data
- robustness and efficiency
- Multiplexer
- statistics about results
21Development Tool Deep Parser
- Non deterministic
- Descriptive point of view
- identification, among the corpus, of various
occurrences of a construction (e.g. coordination) - accurate empirical linguistic description
- tests with the Deep Parser to observe the
evolution of the results (quality quantity) - integration of the results into the grammar
22Development Tool Deep Parser
- Grammar versioning
- correction of the grammar
- test with the Deep Parser to observe the
evolution of the results (quality quantity)
23Development Tool Deep Parser
- Set of properties
- isolation, among the grammar, of a set of
properties - observation of its own behaviour and its impact
on the Deep Parser - modification of this set of properties and/or
its semantics - tests with the Deep Parser to observe the
evolution of the results (quality quantity)
24Deep parsing outputs
Two deep parses for two grammar versions of the
sentence so well ask you too hum why its the
best
25Development Tool Shallow Parser
- Deterministic
- heuristics to control the parse
- Classic left-corner parsing
- Dynamic constraint-satisfaction algorithm
- Test of the efficiency of the grammar over large
corpora
26Shallow parsing outputs
(P) (NP)La celebration (PP)de
(NP)le(AP)dixième (NP)anniversaire
(PP)de (NP)la mort (PP)de (NP)Max
Pol Fouchet (VP)va commencer
(P) (NP)La celebration (PP)de (NP)le
(AP)dixième (NP)anniversaire
(PP)de (NP)la mort
(PP)de (NP)Max Pol Fouchet (VP)va
commencer
Two shallow parses for two different grammar
releases with the sentence the celebration of
the tens anniversary of Max Pol Fouchets death
will begin
27Evaluation Tool Multiplexer
- Parameterised automatic evaluation strategy
- Comparison of phrase common boundaries
statistics - width, nature, count
- No need of a treebank to compare parses.
- With a treebank, the multiplexer becomes an
evaluation device
28Some multiplexers statistics
This evaluation shows two grammars letting NPs
unchanged, and giving 25 of different VPs and
15 of different PPs
29Conclusions perspectives
- Equivalent and contradictory constraints are
specified - The Property Grammar paradigm is simplified and
enriched by such information. - Taggers and parsers still can be improved and
evaluated - A french evaluation project is being prepared
- The results of the current development process
lead to the programming of a context-dependent
granular parser