Title: Parallel Reverse Treebanks for the Discovery of MorphoSyntactic Markings
1Parallel Reverse Treebanks for the Discovery of
Morpho-Syntactic Markings
- Lori Levin
- Robert Frederking
- Alison Alvarez
- Language Technologies Institute
- School of Computer Science
- Carnegie Mellon University
Jeff Good Department of Linguistics Max Planck
Institute for Evolutionary Anthropology
2Reverse Treebank (RTB)
- What?
- Create the syntactic structures first
- Then add sentences
- Why?
- To elicit data from speakers of less commonly
taught languages - Decide what meaning we want to elicit
- Represent the meaning in a feature structure
- Add an English or Spanish sentence (plus context
notes) to express the meaning - Ask the informant to translate it
3Bengali Example
- srcsent The large bus to the post office broke
down. - context
- tgtsent
- ((actor ((modifier ((mod-role mod-descriptor)
- (mod-role role-loc-general-to)))
- (np-identifiability identifiable)(np-specificity
specific) - (np-biological-gender bio-gender-n/a)(np-animacy
anim-inanimate) - (np-person person-third)(np-function
fn-actor)(np-general-type common-noun-type)(np-num
ber num-sg)(np-pronoun-exclusivity
inclusivity-n/a)(np-pronoun-antecedent
antecedent-n/a)(np-distance distance-neutral))) - (c-general-type declarative-clause)(c-my-causer-in
tentionality intentionality-n/a)(c-comparison-type
comparison-n/a)(c-relative-tense
relative-n/a)(c-our-boundary boundary-n/a)(c-compa
rator-function comparator-n/a)(c-causee-control
control-n/a)(c-our-situations situations-n/a)(c-co
mparand-type comparand-n/a)(c-causation-directness
directness-n/a)(c-source source-neutral)(c-causee
-volitionality volition-n/a)(c-assertiveness
assertiveness-neutral)(c-solidarity
solidarity-neutral)(c-polarity polarity-positive)(
c-v-grammatical-aspect gram-aspect-neutral)(c-adju
nct-clause-type adjunct-clause-type-n/a)(c-v-phase
-aspect phase-aspect-neutral)(c-v-lexical-aspect
activity-accomplishment)(c-secondary-type
secondary-neutral)(c-event-modality
event-modality-none)(c-function
fn-main-clause)(c-minor-type minor-n/a)(c-copula-t
ype copula-n/a)(c-v-absolute-tense
past)(c-power-relationship power-peer)(c-our-share
d-subject shared-subject-n/a)(c-question-gap
gap-n/a))
4Outline
- Background
- The AVENUE Machine Translation System
- Contents of the RTB
- An inventory of grammatical meanings
- Languages that have been elicited
- Tools for RTB creation
- Future work
- Evaluation
- Navigation
5AVENUE Machine Translation System
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )
- Type information
- Synchronous Context Free Rules
- Alignments
- x-side constraints
- y-side constraints
- xy-constraints,
- e.g. ((Y1 AGR) (X1 AGR))
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori
Levin (Co-PI) Rule learning Katharina Probst
6AVENUE
- Rules can be written by hand or learned
automatically. - Hybrid
- Rule-based transfer
- Statistical decoder
- Multi-engine combinations with SMT and EBMT
7AVENUE systems(Small and experimental, but
tested on unseen data)
- Hebrew-to-English
- Alon Lavie, Shuly Wintner, Katharina Probst
- Hand-written and automatically learned
- Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules. - Hindi-to-English
- Lavie, Peterson, Probst, Levin, Font, Cohen,
Monson - Automatically learned
- Performs better than SMT when training data is
limited to 50K words
8AVENUE systems(Small and experimental, but
tested on unseen data)
- English-to-Spanish
- Ariadna Font Llitjos
- Hand-written, automatically corrected
- Mapudungun-to-Spanish
- Roberto Aranovich and Christian Monson
- Hand-written
- Dutch-to-English
- Simon Zwarts
- Hand-written
9Elicitation
- Get data from someone who is
- Bilingual
- Literate
- Not experienced with linguistics
10English-Hindi Example
Elicitation Tool Erik Peterson
11English-Chinese Example
12English-Arabic Example
13Elicitation
- srcsent Tú caíste
- tgtsent eymi ütrünagimi
- aligned ((1,1),(2,2))
- context tú Juan masculino, 2a persona del
singular - comment You (John) fell
- srcsent Tú estás cayendo
- tgtsent eymi petu ütrünagimi
- aligned ((1,1),(2 3,2 3))
- context tú Juan masculino, 2a persona del
singular - comment You (John) are falling
- srcsent Tú caíste
- tgtsent eymi ütrunagimi
- aligned ((1,1),(2,2))
- context tú María femenino, 2a persona del
singular - comment You (Mary) fell
14Outline
- Background
- The AVENUE Machine Translation System
- Contents of the RTB
- An inventory of grammatical meanings
- Languages that have been elicited
- Tools for RTB creation
- Future work
- Evaluation
- Navigation
15Size of RTB
- Around 3200 sentences
- 20K words
16Languages
- The set of feature structures with English
sentences has been delivered to the Linguistic
Data Consortium as part of the Reflex program. - Translated (by LDC) into
- Thai
- Bengali
- Plans to translate into
- Seven strategic languages per year for five
years. - As one small part of a language pack (BLARK) for
each language.
17Languages
- Feature structures are being reverse annotated in
Spanish at New Mexico State University (Helmreich
and Cowie) - Plans to translate into Guarani
- Reverse annotation into Portuguese in Brazil
(Marcello Modesto) - Plans to translate into Karitiana
- 200 speakers
- Plans to translate into Inupiaq (Kaplan and
MacLean)
18Previous Elicitation Work
- Pilot corpus
- Around 900 sentences
- No feature structures
- Mapudungun
- Two partial translations
- Quechua
- Three translations
- Aymara
- Seven translations
- Hebrew
- Hindi
- Several translations
- Dutch
19Sample clause level
- Mary is writing a book for John.
- Who let him eat the sandwich?
- Who had the machine crush the car?
- They did not make the policeman run.
- Mary had not blinked.
- The policewoman was willing to chase the boy.
- Our brothers did not destroy files.
- He said that there is not a manual.
- The teacher who wrote a textbook left.
- The policeman chased the man who was a thief.
- Mary began to work.
- Tense, aspect, transitivity
- Questions, causation and permission
- Interaction of lexical and grammatical aspect
- Volitionality
- Embedded clauses and sequence of tense
- Relative clauses
- Phase aspect
20Sample noun phrase level
- The man quit in November.
- The man works in the afternoon.
- The balloon floated over the library.
- The man walked over the platform.
- The man came out from among the group of boys.
- The long weekly meeting ended.
- The large bus to the post office broke down.
- The second man laughed.
- All five boys laughed.
- Temporal and locative meanings
- Quantifiers
- Numbers
- Combinations of different types of modifers
- My book
- Possession, definiteness
- A book of mine
- Possession, indefiniteness
21Example
- srcsent The large bus to the post office broke
down. - ((actor ((modifier ((mod-role mod-descriptor)
- (mod-role role-loc-general-to)))
- (np-identifiability identifiable)(np-specificity
specific) - (np-biological-gender bio-gender-n/a)(np-animacy
anim-inanimate) - (np-person person-third)(np-function
fn-actor)(np-general-type common-noun-type)(np-num
ber num-sg)(np-pronoun-exclusivity
inclusivity-n/a)(np-pronoun-antecedent
antecedent-n/a)(np-distance distance-neutral))) - (c-general-type declarative-clause)(c-my-causer-in
tentionality intentionality-n/a)(c-comparison-type
comparison-n/a)(c-relative-tense
relative-n/a)(c-our-boundary boundary-n/a)(c-compa
rator-function comparator-n/a)(c-causee-control
control-n/a)(c-our-situations situations-n/a)(c-co
mparand-type comparand-n/a)(c-causation-directness
directness-n/a)(c-source source-neutral)(c-causee
-volitionality volition-n/a)(c-assertiveness
assertiveness-neutral)(c-solidarity
solidarity-neutral)(c-polarity polarity-positive)(
c-v-grammatical-aspect gram-aspect-neutral)(c-adju
nct-clause-type adjunct-clause-type-n/a)(c-v-phase
-aspect phase-aspect-neutral)(c-v-lexical-aspect
activity-accomplishment)(c-secondary-type
secondary-neutral)(c-event-modality
event-modality-none)(c-function
fn-main-clause)(c-minor-type minor-n/a)(c-copula-t
ype copula-n/a)(c-v-absolute-tense
past)(c-power-relationship power-peer)(c-our-share
d-subject shared-subject-n/a)(c-question-gap
gap-n/a))
22Grammatical meanings vs syntactic categories
- Features and values are based on a collection of
grammatical meanings - Many of which are similar to the grammatemes of
the Prague Treebanks
23Grammatical Meanings
- YES
- Semantic Roles
- Identifiability
- Specificity
- Time
- Before, after, or during time of speech
- Modality
- NO
- Case
- Voice
- Determiners
- Auxiliary verbs
24Grammatical Meanings
- YES
- How is identifiability expressed?
- Determiner
- Word order
- Optional case marker
- Optional verb agreement
- How is specificity expressed?
- How are generics expressed?
- How are predicate nominals marked?
- NO
- How are English determiners translated?
- The boy cried.
- The lion is a fierce beast.
- I ate a sandwich.
- He is a soldier.
- Il est soldat.
25Argument Roles
- Actor
- Roughly, deep subject
- Undergoer
- Roughly, deep object
- Predicate and predicatee
- The woman is the manager.
- Recipient
- I gave a book to the students.
- Beneficiary
- I made a phone call for Sam.
26Why not subject and object?
- Languages use their voice systems for different
purposes. - Mapudungun obligatorily uses an inverse marked
verb when third person acts on first or second
person. - Verb agrees with undergoer
- Undergoer exhibits other subjecthood properties
- Actor may be object.
- Yes How are actor and undergoer encoded in
combination with other semantic features like
adversity (Japanese) and person (Mapudungun)? - No How is English voice translated into another
language?
27Argument Roles
- Accompaniment
- With someone
- With pleasure
- Material
- (out) of wood
- About 20 more roles
- From the Lingua checklist Comrie Smith (1977)
- Many also found in tectogrammatical
representations - Around 80 locative relations
- From Lingua checklist
- Many temporal relations
28Noun Phrase Features
- Person
- Number
- Biological gender
- Animacy
- Distance (for deictics)
- Identifiability
- Specificity
- Possession
- Other semantic roles
- Accompaniment, material, location, time, etc.
- Type
- Proper, common, pronoun
- Cardinals
- Ordinals
- Quantifiers
- Given and new information
- Not used yet because of limited context in the
elicitation tool.
29Clause level features
- Tense
- Aspect
- Lexical, grammatical, phase
- Type
- Declarative, open-q, yes-no-q
- Function
- Main, argument, adjunct, relative
- Source
- Hearsay, first-hand, sensory, assumed
- Assertedness
- Asserted, presupposed, wanted
- Modality
- Permission, obligation
- Internal, external
30Other clause types(Constructions)
- Causative
- Make/let/have someone do something
- Predication
- May be expressed with or without an overt copula.
- Existential
- There is a problem.
- Impersonal
- One doesnt smoke in restaurants in the US.
- Lament
- If only I had read the paper.
- Conditional
- Comparative
- Etc.
31Outline
- Background
- The AVENUE Machine Translation System
- Contents of the RTB
- An inventory of grammatical meanings
- Languages that have been elicited
- Tools for RTB creation
- Future work
- Evaluation
- Navigation
32Tools for RTB Creation
- Change the inventory of grammatical meanings
- Make new RTBs for other purposes
33The Process
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality
List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
34Feature Specification
- XML Schema
- XSLT Script
- Human readable form
- Feature Causer intentionality
- Values intentional, unintentional
- Feature Causee control
- Values in control, not in control
- Feature Causee volitionality
- Values willing, unwilling
- Feature Causation type
- Values direct, indirect
35Feature Combination
- Person and number interact with tense in many
fusional languages. - In English, tense interacts with questions
- Will you go?
36Feature Combination Template
- ((predicatee
- ((np-general-type pronoun-type common-noun-type)
- (np-person person-first person-second
person-third) - (np-number num-sg num-pl)
- (np-biological-gender bio-gender-male
bio-gender-female))) - (predicate ((np-general-type common-noun-type)
- (np-person person-third)))
- (c-copula-type role)
- (predicate ((adj-general-type quality-type)
- (c-copula-type attributive)))
- (predicate ((np-general-type common-noun-type)
- (np-person person-third)
- (c-copula-type identity)))
- (c-secondary-type secondary-copula) (c-polarity
all) - (c-general-type declarative)
- (c-speech-act sp-act-state)
- (c-v-grammatical-aspect gram-aspect-neutral)
- (c-v-lexical-aspect state)
- (c-v-absolute-tense past present future)
- (c-v-phase-aspect durative))
Summarizes 288 feature structures, which are
automatically generated.
37Annotation Tool
- Feature structure viewer
- Various views of the feature structure
- Omit features whose value is not-applicable
- Group related features together
- Aspect
- causation
38Outline
- Background
- The AVENUE Machine Translation System
- Contents of the RTB
- An inventory of grammatical meanings
- Languages that have been elicited
- Tools for RTB creation
- Future work
- Evaluation
- Navigation
39Evaluation
- Current funding has not covered evaluation of the
RTB. - Except for informal observations as it was
translated into several languages. - Does it elicit the meanings it was intended to
elicit? - Informal observation usually
- Is it useful for machine translation?
40Hard Problems
- Reverse annotating meanings that are not
grammaticalized in English. - Evidentiality
- He stole the bread.
- Context Translate this as if you do not have
first hand knowledge. In English, we might say,
They say that he stole the bread or I hear
that he stole the bread.
41Hard Problems
- Reverse annotating things that can be said in
several ways in English. - Impersonals
- One doesnt smoke here.
- You dont smoke here.
- They dont smoke here.
- Credit cards arent accepted.
- Problem in the Reflex corpus because space was
limited.
42Navigation
- Currently, feature combinations are specified by
a human. - Plan to work in active learning mode.
- Build seed RTB
- Translate some data
- Do some learning
- Identify most valuable pieces of information to
get next - Generate an RTB for those pieces of information
- Translate more
- Learn more
- Generate more, etc.