Parsing with PCFGs and Automatic FStructure Annotation - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Parsing with PCFGs and Automatic FStructure Annotation

Description:

Based on configurational and categorial information (trees or CFG rules) ... So far applied to treebank fragments (SUSANNE) of order of 100 Trees. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 44

Provided by: aca9

Category:

more less

Transcript and Presenter's Notes

Title: Parsing with PCFGs and Automatic FStructure Annotation

1
Parsing with PCFGs and Automatic F-Structure
Annotation

Aoife Cahill, Mairéad McCarthy, Josef van
Genabith, Andy Way
Computer Applications
National Centre for Language Technology
Dublin City University
Dublin Ireland
acahill, mccarthy, josef,
away_at_computing.dcu.ie

2
Background

Treebanks (Penn-II, SUSANNE, ..) are very useful
resources
But just trees ...
Want to know who did what to whom?
LFG f-structures / logical forms

3
Background
4
Overview

How do we get Penn-II with f-structure
information?
Manual
Automatic
regular expressions
set rewriting
algorithm
What do we do with it?
Parsing
text ? PCFG ? AA ? f-structure
text ? PCFG ? AA ? f-structure
Conclusion

5
Automatic Annotation

Based on configurational and categorial
information (trees or CFG rules)
Three possible architectures
regular expression based Sadler, van Genabith,
Way, 2000
rewriting of flat tree descriptions in tree
representation logic Frank, 2000, Liakata and
Pulman, 2002
algorithm Kaplan,1996 ATIS, LFG-DOP, Lappin,
Golan, Rimon, 1989, Cahill et. al, 2002

6
Regular expression based annotation Sadler, van
Genabith, Way, 2000

Extract CFG rules from treebank Charniak,96
vpVP advA v0V0 v0V1 v0V2 sS ppP
Formulate f-str. annotation principles
vpVP v0V1 v0V2
_at_ V1xcompV2,V1subjV2subj.
vpVP (v0) v0V0
_at_ VPV0.
vpVP v0V0 sS
_at_ V0compS.
Apply annotation principles
vpVP advA v0V0 v0V1 v0V2 sS ppP
_at_ VPV0,
V0xcompV1,V0subjV1subj,
V1xcompV2,V1subjV2subj,
V2compS.

7
Regular expression based annotation Sadler, van
Genabith, Way, 2000

Factors out generalisations
Principle-based LFG architecture Bresnan,2000
Recall/precision low to mid 90s
But
So far applied to treebank (AP) fragments of
order of 500 CFG rules
100 trees ..
Penn-II 17K rule types, 50,000 trees

8
Rewriting of flat tree descriptions Frank,2000

Represent tree as flat set of terms in tree
description language
SA
/ \
NPB VPC dom(A,B), dom(A,C),
dom(C,D),
/ \ cat(A,S), cat(B,NP),
cat(C,VP), cat(D,NP),
HP sold NPD pre(B,C), ........
Compaq
Formulate annotation principles as set rewriting
rules Kay,2000
dom(X,Y), dom(X,Z), prec(Y,Z), cat(X,S),
cat(Y,NP), cat(Z,VP)
phi(X,FX), phi(Y,FY), phi(Z,FZ),
subj(FX,FY), eq(FX,FZ)
And apply ...

9
Rewriting of flat tree descriptions Frank,2000

Factors out generalisations
Principle-based LFG architecture Bresnan,2000
Can look at arbitrary tree fragments
Works well
But
So far applied to treebank fragments (SUSANNE) of
order of 100 Trees ..
Penn-II 17K rule types, 50,000 trees

10
Tree-to-F-Structure Algorithms

Tree-to-f-structure algorithm architectures
direct tree-to-f-structure transduction
(Kaplan,1996) ATIS corpus
indirect tree node annotation
Lappin, Golan, Rimon, 1989 subject, direct
object, object of preposition, verb-particle and
noun argument of adjective
Cahill et. al, 2002 f-structure annotation

11
Annotation Algorithm

Recursive procedure (Java)
Scale to Penn-II treebank
Robust
Proto-f-structures
Basic predicate argument structure
Possibly partial (f-structure fragments)
Less detail (some reentrancies)

12
Annotation Algorithm

Clean design to facilitate maintenance and reuse
Linguistic basis
Algorithm
Recursive procedure on treebank trees
Three main components
Ordering important

Left-right context a.p.
co-ordinating configurations a.p.
catch-all clean up
13
Left/right context annotation principles

Apply iff no CC present
Positional and categorial information
Based on simple tri-partition of rule RHS
LHS ? Left Context Head Right Context
LHS ? LC HD RC
NP ? DT ADJP NN NN RCL PP
Penn-II functional annotations

14
Left/right context annotation principles

Head Magerman, 1995 lexicalised PCFGs
LC - H - RC
Complements
Adjuncts

15
Left/right context annotation principles

Example Matrix
NP ? DET NN
NP ? ADJP NNS RELCL
NP ? NN NN PP
lots more

16
Left/right context annotation principles
17
Left/right context annotation principles

Do annotation matrix for each of the monadic
categories in Penn-II
Based on analysing the most frequent rule types
for each category such that
sum total of token frequencies of these rule
types is greater than 85 of total number of rule
tokens for that category
NP 6595 102 VP 10239 307
S 2602 20 ADVP 234 6
Apply annotation matrix to all rules/sub-trees,
i.e. also those NP-LOC, NP-TMP etc.

18
Co-ordinating configurations annotation principles

These vary a lot ..
For all rules with length(RHS) 3 and
RHS X CC Y
CC ? ?
X,Y ???CONJ
etc.

19
Catch-all and Clean-up

Use functional annotations in Penn-II
Everything with X-TMP, X-LOC, .... gets X???ADJ
Every PP unannotated ......
Two NPs under VP
NP1 ?OBJ? NP2 ?OBJ2?

20
Algorithm Example Input
21
Algorithm Example Output
22
Algorithm Example f-structures

Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29
subj adjunct 1 num sing
pers 3
pred Pierre
2 adjunct 3 adjunct 4
pred 61
pers 3
pred years
num pl
pred old
num sing
pers 3
pred Vinken
xcomp subj adjunct 1 num sing
pers 3
pred Pierre
2 adjunct 3
adjunct 4 pred 61
pers
3
pred
years

23
Algorithm Example f-structures

Pierre Vinken , 61 years old , will join the
board as a nonexecutive director Nov. 29
(F-structure continued)
num sing
pers 3
pred Vinken
obj spec det pred the
num sing
pers 3
pred board
obl obj spec det pred a
adjunct 5 pred
nonexecutive
pred director
num sing
pers 3
pred as
pred join
adjunct 6 pred Nov.
num sing

24
Evaluation

Qualitative (with gold standard)
randomly select 105 trees from section 23
hand-annotate with f-str. info (gold standard)
evalb automatically annotated trees against gold
standard
Quantitative (without gold standard)
fragmentation
number of RHS constituents that receive
annotation etc.
number of trees that do not receive f-structure

25
Evaluation Qualitative ("gold standard")

105 sentences from section 23 hand-annotated,
99 sentences len
-- All --
Bracketing Recall 86.16
Bracketing Precision 86.16
Tagging Accuracy 92.06
-- len
Bracketing Recall 86.19
Bracketing Precision 86.19
Tagging Accuracy 91.82

26
Evaluation Quantitative

Results
RHS annotations (excluding punctuation)
LHS RHS RHS
elements annotated annotated
ADJP 1653 1633 99.44
ADJP-ADV 21 21 100.0
ADJP-CLR 27 27 100.0
ADVP 607 606 99.84
NP 30793 29770 96.68
PP 1090 1089 99.92
S 14912 14849 99.58
SBAR 423 422 99.88
SBARQ 270 269 99.61
SQ 657 548 83.43
VP 40990 40800 99.78

27
Evaluation Quantitative

Results
trees with f-structures
0 F-structures 1007 2.048
1 F-structures 43292 88.051
2 F-structures 3555 7.23
3 F-structures 639 1.299
4 F-structures 445 0.905
5 F-structures 169 0.344
6 F-structures 43 0.087
7 F-structures 11 0.022
8 F-structures 5 0.010
9 F-structures 1 0.002

28
Parsing

Parsing and automatic annotation
Two architectures
PCFG automatic annotation pipeline
text ? PCFG ? LFG ? annotated trees
LFG-PCFG integrated model
text ? PCFG?LFG ? annotated trees
12 experiments

29
Parsing

3 baseline PCFGs
PCFG 18K rules
PCFGF 29K rules
PCFGP 27K rules Johnson,98
Derived from sections 02-21
Standard preprocessing
delete empty productions
collapse unary structure
CYK parser, Chomsky normal form
Evaluation
2.4K trees from section 23, PR, F-score

30
Parsing

3 PCFGs LFG Annotation (integrated)
PCFGLFG 31K rules
PCFGFLFG 40K rules
PCFGPLFG 37K rules
Evaluation
2.4K trees section 23, PR, F-score
f-structures generated for section 23,
fragmentation (cheating!)
PR for 105 manually annotated gold standard
trees in section 23 (cheating!)

31
Parsing

3 PCFGs ? LFG Annotation (pipeline)
PCFG ? LFG 18K rules
PCFGF ? LFG 29K rules
PCFGP ? LFG 27K rules
Evaluation
2.4K trees section 23, PR, F-score
f-structures generated for section 23,
fragmentation (cheating!)
PR for 105 manually annotated gold standard
trees in section 23 (cheating!)

32
Parsing

6 PCFGs LFG (compacted, integrated and
pipeline)
thresholding Krotov,Gaizauskas,Hepple,Wilks,1998
, only rules used at least 5 times
PCFGLFGC 4.8K rules PCFGC?LFG
3K rules
PCFGFLFGC 5.5K rules PCFGFC?LFG 4.3K
rules
PCFGPLFGC 5.4K rules PCFGPC?LFG 4.4K
rules
Evaluation
2.4K trees section 23, PR, F-score
f-structures generated for section 23,
fragmentation (cheating!)
PR for 105 manually annotated gold standard
trees in section 23 (cheating!)

33
Parsing Full Grammars
34
Parsing Compacted Grammars
35
Number of Rules
36
Time Taken to Parse
37
Number of parses found
38
Labelled F-Score
39
Unlabelled F-Score
40
receiving 1 F-Structure
41
F-Score of 105 gold standard
42
Conclusion

Design and implementation of automatic
f-structure annotation algorithm for Penn-II
Evaluation of f-structures generated
precision and recall against gold standard
fragmentation/partiality/nodes without annotation
Parsing and automatic annotation
PCFG automatic annotation pipeline
LFG-PCFG integrated model

43
Conclusion