Title: Parsing Resources from Bilingual Texts
1Parsing Resources from Bilingual Texts
- JHU 1998 Summer Workshop on Parsing Free Word
Order Languages - August 19, 1998
- Douglas Jones Cynthia Kuo
2Czech Readers Digest Corpus
- Goal to use an aligned bilingual Czech-English
corpus to build a monolingual Czech grammar. - This summer survey and assessment
- Further work grammar building
- Text Czech Readers Digest Corpus
- About 1M words for each language
- Substantially pre-processed in Prague
3Survey and Assessment
- Process both Czech and English texts with current
parsers - Large set 23K sentences
- look for (near)-isomorphic parses
- Small set 50 sentences
- inventory syntactically-motivated transformations
4Sample Data
that 's strange , he thought . feeling the glass
, he discovered he could dampen the vibration but
could n't make it stop . i'm making it better
, '' les responded .
- to je zvlá¹tnà , pomyslel si .
- sáhl na sklo a zjistil , ¾e chvìnà mù¾e sice
zmÃrnit , ale ¾e ho nezastavà úplnì . - " vylep¹uju ho , " odpovìdìl les .
5Closely Aligned Text
- Lengths of aligned sentences
6Parsing Both Sides of Corpus
- Michael Collins Parser
- English (88 accuracy)
- trained on Penn Treebank
- converted output to dependency trees
- Czech (78 accuracy)
- trained on constituents derived from Prague
Dependency Treebank - yielded constituent and dependency trees
7Exploration of Data Space
8Data Issues
- Tagging
- Tokenizing
- Alignment
- Different treebank encodings
- Dependency versus constituency
9Some tagset affinities
- Part of Czech English
- speech tag tag
- adjective A- J-
- adverb D- R-
- conjunction J- CC
- determiner --, (P-) D-
- noun N- N-
- preposition R- IN
- pronoun P- PRP
- verb V- V-
10Transformable
11Not (easily) transformable
- English Czech
- punctuation, paraphrasing, ...
12Isomorphic Structures
13Partially Matching Trees
14Constituency Mismatches
- The English parses had richer constituent
structure than the Czech parses. - The Collins parser was trained on constituents
derived automatically from the Czech dependency
trees.
15Further Research
- Grammar construction
- Fully process English text (get parse structures)
- Use lexicon (23K correspondences with
probabilities, also string edit distance) - Transformations for English trees to Czech trees
(with lexical anchors) - Applications for Machine Translation
16Dependency Transformations
- Start by finding isomorphic matches
- Allow a transformation based on one mismatched
position - Collect new set of proposed matches as possible
transformation - Use greedy choice of transformations
- (We assume that the Czech-English correspondence
is at best 70 reliable 88 English x 78
Czech).
17Dependency Transformations
- English Czech English Czech
Modal
Verb
Verb
18Transformation Example
(a)
(a)
(b)
- (a) Initial match (2)
- (b) Match after transformation (5)
(a)
(b)
19Brill Florian's Transformation Learning Tool
- English Czech
- feeling the glass feeling on glass
- feeling on glass
- Hypothesis learning Czech parse, given English
parse, - is easier than learning Czech parse from scratch
Take as much of the English parse as matches the
tags.
Apply transformation-driven learning, (with
hand-annotated examples.)
20Vocabulary Correspondence
- About 12K words with likely translations (from
word alignment experiment) - total 1.000000e00
- absurdnà 9
- 3.636364e-01 4 absurd
- 1.818182e-01 54 possible
- 1.818182e-01 2 preposterous
- 9.090909e-02 23044 ltEMPTY_WORD
- 9.090909e-02 7039 and
- Jan Hajic's students
- Cmejrek (1998) Master's Thesis on automatic
extraction - Curin (1988) Undergraduate Thesis on automatic
extraction
21Selecting Translation Guesses
- Minimal Edit Distance
- (Levinshtein Distance)
- 2.50
- 0.71
- 1.5
- 1.0
Czech English
Semantic Resources (WordNet) same synset (Wn
1.6)
22Appendix
- Data Size
- Set Total Sample
- Size Size
- 010 1021 468
- 020 4006 2111
- 030 6115 1131
- 040 5317 1548