Parsing the NEGRA corpus - PowerPoint PPT Presentation

About This Presentation
Title:

Parsing the NEGRA corpus

Description:

Parsing the NEGRA corpus. Greg Donaker. June 14, 2006. NEGRA Corpus. German language ... Bug modeled tag distribution of unknown words as baseline distribution ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 6
Provided by: Gre9216
Learn more at: https://nlp.stanford.edu
Category:
Tags: negra | corpus | parsing

less

Transcript and Presenter's Notes

Title: Parsing the NEGRA corpus


1
Parsing the NEGRA corpus
  • Greg Donaker
  • June 14, 2006

2
NEGRA Corpus
  • German language tagged corpus
  • 20,602 sentences (355,096 tokens)
  • Significantly smaller than Penn Treebank
  • Can be used similarly to Penn Treebank
  • Similar annotations, much flatter trees Dubey
    Keller 2003

3
Baseline error analysis
  • Ran through Stanford Parser using NEGRA specific
    parameters
  • 91.75 tagging accuracy
  • PCFG f-score 66.42
  • Most frequently underproposed rule
  • NP -gt ART NN (98 times)
  • Most frequently underproposed category
  • NN (498 times three times the next category)
  • These errors seem abnormally high based on the
    structure of German language.

4
Approach
  • Bug modeled tag distribution of unknown words as
    baseline distribution
  • Reworked unknown word model to specifics of
    German language
  • Model based on first letter, capitalization of
    first letter, ending substring of words

5
Results
  • Best performing (on both test and validation
    sets) model matched intuition
  • Capitalization of first letter, last two
    characters of word
  • Improves Tagging accuracy from 91.75 to 94.49
  • Improves PCFG F-score from 66.42 to 69.87
  • Reduces underproposed NP-gtART NN from 98 to 48
  • Reduces underproposed NN from 498 to 73
Write a Comment
User Comments (0)
About PowerShow.com