Parsing Resources from Bilingual Texts - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Parsing Resources from Bilingual Texts

Description:

Michael Collins Parser. English: (88% accuracy) trained on Penn Treebank ... The Collins parser was trained on constituents derived automatically from the ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 23

Provided by: dougl205

Category:

more less

Transcript and Presenter's Notes

Title: Parsing Resources from Bilingual Texts

1
Parsing Resources from Bilingual Texts

JHU 1998 Summer Workshop on Parsing Free Word
Order Languages
August 19, 1998
Douglas Jones Cynthia Kuo

2
Czech Readers Digest Corpus

Goal to use an aligned bilingual Czech-English
corpus to build a monolingual Czech grammar.
This summer survey and assessment
Further work grammar building
Text Czech Readers Digest Corpus
About 1M words for each language
Substantially pre-processed in Prague

3
Survey and Assessment

Process both Czech and English texts with current
parsers
Large set 23K sentences
look for (near)-isomorphic parses
Small set 50 sentences
inventory syntactically-motivated transformations

4
Sample Data
that 's strange , he thought . feeling the glass
, he discovered he could dampen the vibration but
could n't make it stop . i'm making it better
, '' les responded .

to je zvlá¹tní , pomyslel si .
sáhl na sklo a zjistil , ¾e chvìní mù¾e sice
zmírnit , ale ¾e ho nezastaví úplnì .
" vylep¹uju ho , " odpovìdìl les .

5
Closely Aligned Text

Lengths of aligned sentences

6
Parsing Both Sides of Corpus

Michael Collins Parser
English (88 accuracy)
trained on Penn Treebank
converted output to dependency trees
Czech (78 accuracy)
trained on constituents derived from Prague
Dependency Treebank
yielded constituent and dependency trees

7
Exploration of Data Space
8
Data Issues

Tagging
Tokenizing
Alignment
Different treebank encodings
Dependency versus constituency

9
Some tagset affinities

Part of Czech English
speech tag tag
adjective A- J-
adverb D- R-
conjunction J- CC
determiner --, (P-) D-
noun N- N-
preposition R- IN
pronoun P- PRP
verb V- V-

10
Transformable

English Czech

11
Not (easily) transformable

English Czech
punctuation, paraphrasing, ...

12
Isomorphic Structures

English Czech

13
Partially Matching Trees
14
Constituency Mismatches

The English parses had richer constituent
structure than the Czech parses.
The Collins parser was trained on constituents
derived automatically from the Czech dependency
trees.

15
Further Research

Grammar construction
Fully process English text (get parse structures)
Use lexicon (23K correspondences with
probabilities, also string edit distance)
Transformations for English trees to Czech trees
(with lexical anchors)
Applications for Machine Translation

16
Dependency Transformations

Start by finding isomorphic matches
Allow a transformation based on one mismatched
position
Collect new set of proposed matches as possible
transformation
Use greedy choice of transformations
(We assume that the Czech-English correspondence
is at best 70 reliable 88 English x 78
Czech).

17
Dependency Transformations

English Czech English Czech

Modal
Verb
Verb
18
Transformation Example
(a)
(a)
(b)

(a) Initial match (2)
(b) Match after transformation (5)

(a)
(b)
19
Brill Florian's Transformation Learning Tool

English Czech
feeling the glass feeling on glass
feeling on glass
Hypothesis learning Czech parse, given English
parse,
is easier than learning Czech parse from scratch

Take as much of the English parse as matches the
tags.
Apply transformation-driven learning, (with
hand-annotated examples.)
20
Vocabulary Correspondence