Title: Parsing with PCFGs and Automatic FStructure Annotation
1Parsing with PCFGs and Automatic F-Structure
Annotation
- Aoife Cahill, Mairéad McCarthy, Josef van
Genabith, Andy Way - Computer Applications
- National Centre for Language Technology
- Dublin City University
- Dublin Ireland
- acahill, mccarthy, josef,
away_at_computing.dcu.ie
2Background
- Treebanks (Penn-II, SUSANNE, ..) are very useful
resources - But just trees ...
- Want to know who did what to whom?
- LFG f-structures / logical forms
3Background
4Overview
- How do we get Penn-II with f-structure
information? - Manual
- Automatic
- regular expressions
- set rewriting
- algorithm
- What do we do with it?
- Parsing
- text ? PCFG ? AA ? f-structure
- text ? PCFG ? AA ? f-structure
- Conclusion
5Automatic Annotation
- Based on configurational and categorial
information (trees or CFG rules) - Three possible architectures
- regular expression based Sadler, van Genabith,
Way, 2000 - rewriting of flat tree descriptions in tree
representation logic Frank, 2000, Liakata and
Pulman, 2002 - algorithm Kaplan,1996 ATIS, LFG-DOP, Lappin,
Golan, Rimon, 1989, Cahill et. al, 2002
6Regular expression based annotation Sadler, van
Genabith, Way, 2000
- Extract CFG rules from treebank Charniak,96
- vpVP advA v0V0 v0V1 v0V2 sS ppP
- Formulate f-str. annotation principles
- vpVP v0V1 v0V2
- _at_ V1xcompV2,V1subjV2subj.
- vpVP (v0) v0V0
- _at_ VPV0.
- vpVP v0V0 sS
- _at_ V0compS.
- Apply annotation principles
- vpVP advA v0V0 v0V1 v0V2 sS ppP
- _at_ VPV0,
- V0xcompV1,V0subjV1subj,
- V1xcompV2,V1subjV2subj,
- V2compS.
7Regular expression based annotation Sadler, van
Genabith, Way, 2000
- Factors out generalisations
- Principle-based LFG architecture Bresnan,2000
- Recall/precision low to mid 90s
- But
- So far applied to treebank (AP) fragments of
order of 500 CFG rules - 100 trees ..
- Penn-II 17K rule types, 50,000 trees
8Rewriting of flat tree descriptions Frank,2000
- Represent tree as flat set of terms in tree
description language - SA
- / \
- NPB VPC dom(A,B), dom(A,C),
dom(C,D), - / \ cat(A,S), cat(B,NP),
cat(C,VP), cat(D,NP), - HP sold NPD pre(B,C), ........
-
- Compaq
- Formulate annotation principles as set rewriting
rules Kay,2000 - dom(X,Y), dom(X,Z), prec(Y,Z), cat(X,S),
cat(Y,NP), cat(Z,VP) - phi(X,FX), phi(Y,FY), phi(Z,FZ),
subj(FX,FY), eq(FX,FZ) - And apply ...
9Rewriting of flat tree descriptions Frank,2000
- Factors out generalisations
- Principle-based LFG architecture Bresnan,2000
- Can look at arbitrary tree fragments
- Works well
- But
- So far applied to treebank fragments (SUSANNE) of
order of 100 Trees .. - Penn-II 17K rule types, 50,000 trees
10Tree-to-F-Structure Algorithms
- Tree-to-f-structure algorithm architectures
- direct tree-to-f-structure transduction
- (Kaplan,1996) ATIS corpus
- indirect tree node annotation
- Lappin, Golan, Rimon, 1989 subject, direct
object, object of preposition, verb-particle and
noun argument of adjective - Cahill et. al, 2002 f-structure annotation
11Annotation Algorithm
- Recursive procedure (Java)
- Scale to Penn-II treebank
- Robust
- Proto-f-structures
- Basic predicate argument structure
- Possibly partial (f-structure fragments)
- Less detail (some reentrancies)
12Annotation Algorithm
- Clean design to facilitate maintenance and reuse
- Linguistic basis
- Algorithm
- Recursive procedure on treebank trees
- Three main components
-
- Ordering important
Left-right context a.p.
co-ordinating configurations a.p.
catch-all clean up
13 Left/right context annotation principles
- Apply iff no CC present
- Positional and categorial information
- Based on simple tri-partition of rule RHS
- LHS ? Left Context Head Right Context
- LHS ? LC HD RC
- NP ? DT ADJP NN NN RCL PP
- Penn-II functional annotations
14Left/right context annotation principles
- Head Magerman, 1995 lexicalised PCFGs
- LC - H - RC
- Complements
- Adjuncts
15Left/right context annotation principles
- Example Matrix
- NP ? DET NN
- NP ? ADJP NNS RELCL
- NP ? NN NN PP
- lots more
16 Left/right context annotation principles
17Left/right context annotation principles
- Do annotation matrix for each of the monadic
categories in Penn-II - Based on analysing the most frequent rule types
for each category such that - sum total of token frequencies of these rule
types is greater than 85 of total number of rule
tokens for that category - NP 6595 102 VP 10239 307
- S 2602 20 ADVP 234 6
- Apply annotation matrix to all rules/sub-trees,
i.e. also those NP-LOC, NP-TMP etc.
18Co-ordinating configurations annotation principles
- These vary a lot ..
- For all rules with length(RHS) 3 and
- RHS X CC Y
- CC ? ?
- X,Y ???CONJ
- etc.
19Catch-all and Clean-up
- Use functional annotations in Penn-II
- Everything with X-TMP, X-LOC, .... gets X???ADJ
- Every PP unannotated ......
- Two NPs under VP
- NP1 ?OBJ? NP2 ?OBJ2?
20Algorithm Example Input
21Algorithm Example Output
22Algorithm Example f-structures
- Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 - subj adjunct 1 num sing
- pers 3
- pred Pierre
- 2 adjunct 3 adjunct 4
pred 61 - pers 3
- pred years
- num pl
- pred old
- num sing
- pers 3
- pred Vinken
- xcomp subj adjunct 1 num sing
- pers 3
- pred Pierre
- 2 adjunct 3
adjunct 4 pred 61 - pers
3 - pred
years
23Algorithm Example f-structures
- Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29 - (F-structure continued)
- num sing
- pers 3
- pred Vinken
- obj spec det pred the
- num sing
- pers 3
- pred board
- obl obj spec det pred a
- adjunct 5 pred
nonexecutive - pred director
- num sing
- pers 3
- pred as
- pred join
- adjunct 6 pred Nov.
- num sing
24Evaluation
- Qualitative (with gold standard)
- randomly select 105 trees from section 23
- hand-annotate with f-str. info (gold standard)
- evalb automatically annotated trees against gold
standard - Quantitative (without gold standard)
- fragmentation
- number of RHS constituents that receive
annotation etc. - number of trees that do not receive f-structure
25Evaluation Qualitative ("gold standard")
- 105 sentences from section 23 hand-annotated,
- 99 sentences len
- -- All --
- Bracketing Recall 86.16
- Bracketing Precision 86.16
- Tagging Accuracy 92.06
- -- len
- Bracketing Recall 86.19
- Bracketing Precision 86.19
- Tagging Accuracy 91.82
26Evaluation Quantitative
- Results
- RHS annotations (excluding punctuation)
- LHS RHS RHS
- elements annotated annotated
- ADJP 1653 1633 99.44
- ADJP-ADV 21 21 100.0
- ADJP-CLR 27 27 100.0
- ADVP 607 606 99.84
- NP 30793 29770 96.68
- PP 1090 1089 99.92
- S 14912 14849 99.58
- SBAR 423 422 99.88
- SBARQ 270 269 99.61
- SQ 657 548 83.43
- VP 40990 40800 99.78
27Evaluation Quantitative
- Results
- trees with f-structures
- 0 F-structures 1007 2.048
- 1 F-structures 43292 88.051
- 2 F-structures 3555 7.23
- 3 F-structures 639 1.299
- 4 F-structures 445 0.905
- 5 F-structures 169 0.344
- 6 F-structures 43 0.087
- 7 F-structures 11 0.022
- 8 F-structures 5 0.010
- 9 F-structures 1 0.002
28Parsing
- Parsing and automatic annotation
- Two architectures
- PCFG automatic annotation pipeline
- text ? PCFG ? LFG ? annotated trees
- LFG-PCFG integrated model
- text ? PCFG?LFG ? annotated trees
- 12 experiments
29Parsing
- 3 baseline PCFGs
- PCFG 18K rules
- PCFGF 29K rules
- PCFGP 27K rules Johnson,98
- Derived from sections 02-21
- Standard preprocessing
- delete empty productions
- collapse unary structure
- CYK parser, Chomsky normal form
- Evaluation
- 2.4K trees from section 23, PR, F-score
30Parsing
- 3 PCFGs LFG Annotation (integrated)
- PCFGLFG 31K rules
- PCFGFLFG 40K rules
- PCFGPLFG 37K rules
- Evaluation
- 2.4K trees section 23, PR, F-score
- f-structures generated for section 23,
fragmentation (cheating!) - PR for 105 manually annotated gold standard
trees in section 23 (cheating!)
31Parsing
- 3 PCFGs ? LFG Annotation (pipeline)
- PCFG ? LFG 18K rules
- PCFGF ? LFG 29K rules
- PCFGP ? LFG 27K rules
- Evaluation
- 2.4K trees section 23, PR, F-score
- f-structures generated for section 23,
fragmentation (cheating!) - PR for 105 manually annotated gold standard
trees in section 23 (cheating!)
32Parsing
- 6 PCFGs LFG (compacted, integrated and
pipeline) - thresholding Krotov,Gaizauskas,Hepple,Wilks,1998
, only rules used at least 5 times - PCFGLFGC 4.8K rules PCFGC?LFG
3K rules - PCFGFLFGC 5.5K rules PCFGFC?LFG 4.3K
rules - PCFGPLFGC 5.4K rules PCFGPC?LFG 4.4K
rules - Evaluation
- 2.4K trees section 23, PR, F-score
- f-structures generated for section 23,
fragmentation (cheating!) - PR for 105 manually annotated gold standard
trees in section 23 (cheating!)
33Parsing Full Grammars
34Parsing Compacted Grammars
35Number of Rules
36Time Taken to Parse
37Number of parses found
38Labelled F-Score
39Unlabelled F-Score
40 receiving 1 F-Structure
41F-Score of 105 gold standard
42Conclusion
- Design and implementation of automatic
f-structure annotation algorithm for Penn-II - Evaluation of f-structures generated
- precision and recall against gold standard
- fragmentation/partiality/nodes without annotation
- Parsing and automatic annotation
- PCFG automatic annotation pipeline
- LFG-PCFG integrated model
43Conclusion
- Further work
- root forms as PRED values
- refine annotation algorithm
- ParseEval gold standard, PARC gold standard
- use XEROX XLE constraint solver
- localise long distance phenomena functional
uncertainty equations - exploit Penn-II annotations for extracted
material - compile f-structures into Quasi Logical Forms /
UDRSs - More sophisticated probability models for
parsing - and, and, and ...
- Demo