Parsing with PCFGs and Automatic FStructure Annotation - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Parsing with PCFGs and Automatic FStructure Annotation

Description:

Based on configurational and categorial information (trees or CFG rules) ... So far applied to treebank fragments (SUSANNE) of order of 100 Trees. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 44
Provided by: aca9
Category:

less

Transcript and Presenter's Notes

Title: Parsing with PCFGs and Automatic FStructure Annotation


1
Parsing with PCFGs and Automatic F-Structure
Annotation
  • Aoife Cahill, Mairéad McCarthy, Josef van
    Genabith, Andy Way
  • Computer Applications
  • National Centre for Language Technology
  • Dublin City University
  • Dublin Ireland
  • acahill, mccarthy, josef,
    away_at_computing.dcu.ie

2
Background
  • Treebanks (Penn-II, SUSANNE, ..) are very useful
    resources
  • But just trees ...
  • Want to know who did what to whom?
  • LFG f-structures / logical forms

3
Background
4
Overview
  • How do we get Penn-II with f-structure
    information?
  • Manual
  • Automatic
  • regular expressions
  • set rewriting
  • algorithm
  • What do we do with it?
  • Parsing
  • text ? PCFG ? AA ? f-structure
  • text ? PCFG ? AA ? f-structure
  • Conclusion

5
Automatic Annotation
  • Based on configurational and categorial
    information (trees or CFG rules)
  • Three possible architectures
  • regular expression based Sadler, van Genabith,
    Way, 2000
  • rewriting of flat tree descriptions in tree
    representation logic Frank, 2000, Liakata and
    Pulman, 2002
  • algorithm Kaplan,1996 ATIS, LFG-DOP, Lappin,
    Golan, Rimon, 1989, Cahill et. al, 2002

6
Regular expression based annotation Sadler, van
Genabith, Way, 2000
  • Extract CFG rules from treebank Charniak,96
  • vpVP advA v0V0 v0V1 v0V2 sS ppP
  • Formulate f-str. annotation principles
  • vpVP v0V1 v0V2
  • _at_ V1xcompV2,V1subjV2subj.
  • vpVP (v0) v0V0
  • _at_ VPV0.
  • vpVP v0V0 sS
  • _at_ V0compS.
  • Apply annotation principles
  • vpVP advA v0V0 v0V1 v0V2 sS ppP
  • _at_ VPV0,
  • V0xcompV1,V0subjV1subj,
  • V1xcompV2,V1subjV2subj,
  • V2compS.

7
Regular expression based annotation Sadler, van
Genabith, Way, 2000
  • Factors out generalisations
  • Principle-based LFG architecture Bresnan,2000
  • Recall/precision low to mid 90s
  • But
  • So far applied to treebank (AP) fragments of
    order of 500 CFG rules
  • 100 trees ..
  • Penn-II 17K rule types, 50,000 trees

8
Rewriting of flat tree descriptions Frank,2000
  • Represent tree as flat set of terms in tree
    description language
  • SA
  • / \
  • NPB VPC dom(A,B), dom(A,C),
    dom(C,D),
  • / \ cat(A,S), cat(B,NP),
    cat(C,VP), cat(D,NP),
  • HP sold NPD pre(B,C), ........
  • Compaq
  • Formulate annotation principles as set rewriting
    rules Kay,2000
  • dom(X,Y), dom(X,Z), prec(Y,Z), cat(X,S),
    cat(Y,NP), cat(Z,VP)
  • phi(X,FX), phi(Y,FY), phi(Z,FZ),
    subj(FX,FY), eq(FX,FZ)
  • And apply ...

9
Rewriting of flat tree descriptions Frank,2000
  • Factors out generalisations
  • Principle-based LFG architecture Bresnan,2000
  • Can look at arbitrary tree fragments
  • Works well
  • But
  • So far applied to treebank fragments (SUSANNE) of
    order of 100 Trees ..
  • Penn-II 17K rule types, 50,000 trees

10
Tree-to-F-Structure Algorithms
  • Tree-to-f-structure algorithm architectures
  • direct tree-to-f-structure transduction
  • (Kaplan,1996) ATIS corpus
  • indirect tree node annotation
  • Lappin, Golan, Rimon, 1989 subject, direct
    object, object of preposition, verb-particle and
    noun argument of adjective
  • Cahill et. al, 2002 f-structure annotation

11
Annotation Algorithm
  • Recursive procedure (Java)
  • Scale to Penn-II treebank
  • Robust
  • Proto-f-structures
  • Basic predicate argument structure
  • Possibly partial (f-structure fragments)
  • Less detail (some reentrancies)

12
Annotation Algorithm
  • Clean design to facilitate maintenance and reuse
  • Linguistic basis
  • Algorithm
  • Recursive procedure on treebank trees
  • Three main components
  • Ordering important

Left-right context a.p.
co-ordinating configurations a.p.
catch-all clean up
13
Left/right context annotation principles
  • Apply iff no CC present
  • Positional and categorial information
  • Based on simple tri-partition of rule RHS
  • LHS ? Left Context Head Right Context
  • LHS ? LC HD RC
  • NP ? DT ADJP NN NN RCL PP
  • Penn-II functional annotations

14
Left/right context annotation principles
  • Head Magerman, 1995 lexicalised PCFGs
  • LC - H - RC
  • Complements
  • Adjuncts

15
Left/right context annotation principles
  • Example Matrix
  • NP ? DET NN
  • NP ? ADJP NNS RELCL
  • NP ? NN NN PP
  • lots more

16
Left/right context annotation principles
17
Left/right context annotation principles
  • Do annotation matrix for each of the monadic
    categories in Penn-II
  • Based on analysing the most frequent rule types
    for each category such that
  • sum total of token frequencies of these rule
    types is greater than 85 of total number of rule
    tokens for that category
  • NP 6595 102 VP 10239 307
  • S 2602 20 ADVP 234 6
  • Apply annotation matrix to all rules/sub-trees,
    i.e. also those NP-LOC, NP-TMP etc.

18
Co-ordinating configurations annotation principles
  • These vary a lot ..
  • For all rules with length(RHS) 3 and
  • RHS X CC Y
  • CC ? ?
  • X,Y ???CONJ
  • etc.

19
Catch-all and Clean-up
  • Use functional annotations in Penn-II
  • Everything with X-TMP, X-LOC, .... gets X???ADJ
  • Every PP unannotated ......
  • Two NPs under VP
  • NP1 ?OBJ? NP2 ?OBJ2?

20
Algorithm Example Input
21
Algorithm Example Output
22
Algorithm Example f-structures
  • Pierre Vinken , 61 years old , will join the
    board as a nonexecutive director Nov. 29
  • subj adjunct 1 num sing
  • pers 3
  • pred Pierre
  • 2 adjunct 3 adjunct 4
    pred 61
  • pers 3
  • pred years
  • num pl
  • pred old
  • num sing
  • pers 3
  • pred Vinken
  • xcomp subj adjunct 1 num sing
  • pers 3
  • pred Pierre
  • 2 adjunct 3
    adjunct 4 pred 61
  • pers
    3
  • pred
    years

23
Algorithm Example f-structures
  • Pierre Vinken , 61 years old , will join the
    board as a nonexecutive director Nov. 29
  • (F-structure continued)
  • num sing
  • pers 3
  • pred Vinken
  • obj spec det pred the
  • num sing
  • pers 3
  • pred board
  • obl obj spec det pred a
  • adjunct 5 pred
    nonexecutive
  • pred director
  • num sing
  • pers 3
  • pred as
  • pred join
  • adjunct 6 pred Nov.
  • num sing

24
Evaluation
  • Qualitative (with gold standard)
  • randomly select 105 trees from section 23
  • hand-annotate with f-str. info (gold standard)
  • evalb automatically annotated trees against gold
    standard
  • Quantitative (without gold standard)
  • fragmentation
  • number of RHS constituents that receive
    annotation etc.
  • number of trees that do not receive f-structure

25
Evaluation Qualitative ("gold standard")
  • 105 sentences from section 23 hand-annotated,
  • 99 sentences len
  • -- All --
  • Bracketing Recall 86.16
  • Bracketing Precision 86.16
  • Tagging Accuracy 92.06
  • -- len
  • Bracketing Recall 86.19
  • Bracketing Precision 86.19
  • Tagging Accuracy 91.82

26
Evaluation Quantitative
  • Results
  • RHS annotations (excluding punctuation)
  • LHS RHS RHS
  • elements annotated annotated
  • ADJP 1653 1633 99.44
  • ADJP-ADV 21 21 100.0
  • ADJP-CLR 27 27 100.0
  • ADVP 607 606 99.84
  • NP 30793 29770 96.68
  • PP 1090 1089 99.92
  • S 14912 14849 99.58
  • SBAR 423 422 99.88
  • SBARQ 270 269 99.61
  • SQ 657 548 83.43
  • VP 40990 40800 99.78

27
Evaluation Quantitative
  • Results
  • trees with f-structures
  • 0 F-structures 1007 2.048
  • 1 F-structures 43292 88.051
  • 2 F-structures 3555 7.23
  • 3 F-structures 639 1.299
  • 4 F-structures 445 0.905
  • 5 F-structures 169 0.344
  • 6 F-structures 43 0.087
  • 7 F-structures 11 0.022
  • 8 F-structures 5 0.010
  • 9 F-structures 1 0.002

28
Parsing
  • Parsing and automatic annotation
  • Two architectures
  • PCFG automatic annotation pipeline
  • text ? PCFG ? LFG ? annotated trees
  • LFG-PCFG integrated model
  • text ? PCFG?LFG ? annotated trees
  • 12 experiments

29
Parsing
  • 3 baseline PCFGs
  • PCFG 18K rules
  • PCFGF 29K rules
  • PCFGP 27K rules Johnson,98
  • Derived from sections 02-21
  • Standard preprocessing
  • delete empty productions
  • collapse unary structure
  • CYK parser, Chomsky normal form
  • Evaluation
  • 2.4K trees from section 23, PR, F-score

30
Parsing
  • 3 PCFGs LFG Annotation (integrated)
  • PCFGLFG 31K rules
  • PCFGFLFG 40K rules
  • PCFGPLFG 37K rules
  • Evaluation
  • 2.4K trees section 23, PR, F-score
  • f-structures generated for section 23,
    fragmentation (cheating!)
  • PR for 105 manually annotated gold standard
    trees in section 23 (cheating!)

31
Parsing
  • 3 PCFGs ? LFG Annotation (pipeline)
  • PCFG ? LFG 18K rules
  • PCFGF ? LFG 29K rules
  • PCFGP ? LFG 27K rules
  • Evaluation
  • 2.4K trees section 23, PR, F-score
  • f-structures generated for section 23,
    fragmentation (cheating!)
  • PR for 105 manually annotated gold standard
    trees in section 23 (cheating!)

32
Parsing
  • 6 PCFGs LFG (compacted, integrated and
    pipeline)
  • thresholding Krotov,Gaizauskas,Hepple,Wilks,1998
    , only rules used at least 5 times
  • PCFGLFGC 4.8K rules PCFGC?LFG
    3K rules
  • PCFGFLFGC 5.5K rules PCFGFC?LFG 4.3K
    rules
  • PCFGPLFGC 5.4K rules PCFGPC?LFG 4.4K
    rules
  • Evaluation
  • 2.4K trees section 23, PR, F-score
  • f-structures generated for section 23,
    fragmentation (cheating!)
  • PR for 105 manually annotated gold standard
    trees in section 23 (cheating!)

33
Parsing Full Grammars
34
Parsing Compacted Grammars
35
Number of Rules
36
Time Taken to Parse
37
Number of parses found
38
Labelled F-Score
39
Unlabelled F-Score
40
receiving 1 F-Structure
41
F-Score of 105 gold standard
42
Conclusion
  • Design and implementation of automatic
    f-structure annotation algorithm for Penn-II
  • Evaluation of f-structures generated
  • precision and recall against gold standard
  • fragmentation/partiality/nodes without annotation
  • Parsing and automatic annotation
  • PCFG automatic annotation pipeline
  • LFG-PCFG integrated model

43
Conclusion
  • Further work
  • root forms as PRED values
  • refine annotation algorithm
  • ParseEval gold standard, PARC gold standard
  • use XEROX XLE constraint solver
  • localise long distance phenomena functional
    uncertainty equations
  • exploit Penn-II annotations for extracted
    material
  • compile f-structures into Quasi Logical Forms /
    UDRSs
  • More sophisticated probability models for
    parsing
  • and, and, and ...
  • Demo
Write a Comment
User Comments (0)
About PowerShow.com