Parsing Resources from Bilingual Texts - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Parsing Resources from Bilingual Texts

Description:

Michael Collins Parser. English: (88% accuracy) trained on Penn Treebank ... The Collins parser was trained on constituents derived automatically from the ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 23
Provided by: dougl205
Category:

less

Transcript and Presenter's Notes

Title: Parsing Resources from Bilingual Texts


1
Parsing Resources from Bilingual Texts
  • JHU 1998 Summer Workshop on Parsing Free Word
    Order Languages
  • August 19, 1998
  • Douglas Jones Cynthia Kuo

2
Czech Readers Digest Corpus
  • Goal to use an aligned bilingual Czech-English
    corpus to build a monolingual Czech grammar.
  • This summer survey and assessment
  • Further work grammar building
  • Text Czech Readers Digest Corpus
  • About 1M words for each language
  • Substantially pre-processed in Prague

3
Survey and Assessment
  • Process both Czech and English texts with current
    parsers
  • Large set 23K sentences
  • look for (near)-isomorphic parses
  • Small set 50 sentences
  • inventory syntactically-motivated transformations

4
Sample Data
that 's strange , he thought . feeling the glass
, he discovered he could dampen the vibration but
could n't make it stop . i'm making it better
, '' les responded .
  • to je zvlá¹tní , pomyslel si .
  • sáhl na sklo a zjistil , ¾e chvìní mù¾e sice
    zmírnit , ale ¾e ho nezastaví úplnì .
  • " vylep¹uju ho , " odpovìdìl les .

5
Closely Aligned Text
  • Lengths of aligned sentences

6
Parsing Both Sides of Corpus
  • Michael Collins Parser
  • English (88 accuracy)
  • trained on Penn Treebank
  • converted output to dependency trees
  • Czech (78 accuracy)
  • trained on constituents derived from Prague
    Dependency Treebank
  • yielded constituent and dependency trees

7
Exploration of Data Space
8
Data Issues
  • Tagging
  • Tokenizing
  • Alignment
  • Different treebank encodings
  • Dependency versus constituency

9
Some tagset affinities
  • Part of Czech English
  • speech tag tag
  • adjective A- J-
  • adverb D- R-
  • conjunction J- CC
  • determiner --, (P-) D-
  • noun N- N-
  • preposition R- IN
  • pronoun P- PRP
  • verb V- V-

10
Transformable
  • English Czech

11
Not (easily) transformable
  • English Czech
  • punctuation, paraphrasing, ...

12
Isomorphic Structures
  • English Czech

13
Partially Matching Trees
14
Constituency Mismatches
  • The English parses had richer constituent
    structure than the Czech parses.
  • The Collins parser was trained on constituents
    derived automatically from the Czech dependency
    trees.

15
Further Research
  • Grammar construction
  • Fully process English text (get parse structures)
  • Use lexicon (23K correspondences with
    probabilities, also string edit distance)
  • Transformations for English trees to Czech trees
    (with lexical anchors)
  • Applications for Machine Translation

16
Dependency Transformations
  • Start by finding isomorphic matches
  • Allow a transformation based on one mismatched
    position
  • Collect new set of proposed matches as possible
    transformation
  • Use greedy choice of transformations
  • (We assume that the Czech-English correspondence
    is at best 70 reliable 88 English x 78
    Czech).

17
Dependency Transformations
  • English Czech English Czech

Modal
Verb
Verb
18
Transformation Example
(a)
(a)
(b)
  • (a) Initial match (2)
  • (b) Match after transformation (5)

(a)
(b)
19
Brill Florian's Transformation Learning Tool
  • English Czech
  • feeling the glass feeling on glass
  • feeling on glass
  • Hypothesis learning Czech parse, given English
    parse,
  • is easier than learning Czech parse from scratch

Take as much of the English parse as matches the
tags.
Apply transformation-driven learning, (with
hand-annotated examples.)
20
Vocabulary Correspondence
  • About 12K words with likely translations (from
    word alignment experiment)
  • total 1.000000e00
  • absurdní 9
  • 3.636364e-01 4 absurd
  • 1.818182e-01 54 possible
  • 1.818182e-01 2 preposterous
  • 9.090909e-02 23044 ltEMPTY_WORD
  • 9.090909e-02 7039 and
  • Jan Hajic's students
  • Cmejrek (1998) Master's Thesis on automatic
    extraction
  • Curin (1988) Undergraduate Thesis on automatic
    extraction

21
Selecting Translation Guesses
  • Minimal Edit Distance
  • (Levinshtein Distance)
  • 2.50
  • 0.71
  • 1.5
  • 1.0

Czech English
Semantic Resources (WordNet) same synset (Wn
1.6)
22
Appendix
  • Data Size
  • Set Total Sample
  • Size Size
  • 010 1021 468
  • 020 4006 2111
  • 030 6115 1131
  • 040 5317 1548
Write a Comment
User Comments (0)
About PowerShow.com