Title: Data Elicitation for AVENUE
1Data Elicitation for AVENUE
- By Alison Alvarez
- Lori Levin
- Bob Frederking
- Jeff Good (MPI Leipzig)
- Erik Peterson
2Avenue System Diagram
3Goals for Corpus Creation and Elicitation
- Parallel corpus with high quality word alignment
- For a language with little or no digitized
language resources - Use a bilingual informant with no linguistic
expertise
4Outline
- Elicitation
- Feature Detection
- The Functional-Typological Corpus
- Corpus Creation and Elicitation
- Corpus Navigation
5The Elicitation Tool
6Input to the Elicitation Tool
- Eliciting from English
- 1,2,3 Sg,pl person pronouns
- newpair
- srcsent I sing
- context
- comment
- newpair
- srcsent I sang
- context
- comment
- newpair
- srcsent I am singing
- context
- comment
- newpair
- srcsent You sang
- Eliciting from Spanish
- 1,2,3 Sg,pl person pronouns
- newpair
- srcsent Canto
- context
- comment
- newpair
- srcsent Canté
- context
- comment
- newpair
- srcsent Estoy cantando
- context
- comment
- newpair
- srcsent Cantaste
7Output of the elicitation process
- newpair
- srcsent Tú caíste
- tgtsent eymi ütrünagimi
- aligned ((1,1),(2,2))
- context tú Juan masculino, 2a persona del
singular - comment You (John) fell
- newpair
- srcsent Tú estás cayendo
- tgtsent eymi petu ütünagimi
- aligned ((1,1),(2 3,2 3))
- context tú Juan masculino, 2a persona del
singular - comment You (John) are falling
- newpair
- srcsent Tú caíste
- tgtsent eymi, ütrunagimi
- aligned ((1,1),(2,2))
- context tú María femenino, 2a persona del
singular
8Elicitation Corpus
- Elicitation Corpus refers to the list of
sentences in the major language. - Not yet translated or aligned
- Field workers call it a questionnaire.
9Feature Detection
- Identify meaning components that have
morpho-syntactic consequences in the language
that is being elicited. - The gender of the subject is marked on the verb
in Hebrew. - The gender of the subject has no morpho-syntactic
realization in Mapudungun.
10Feature detection feeds into
- Corpus Navigation which minimal pairs to pursue
next. - Dont pursue gender in Mapudungun
- Do pursue definiteness in Hebrew
- Morphology Learning
- Morphological rule learner identifies the forms
of the morphemes - Feature detection identifies the functions
- Rule learning
- Rule learner will have to learn a constraints
corresponding to fact records. - E.g., Adjectives and nouns agree in gender,
number, and definiteness in Hebrew.
11Other uses of Feature Detection
- A human-readable reference grammar can be
generated from fact records. - A human analyst knows Northern Ostyak, and then
has to translate a document in Eastern Ostyak.
The only reference grammar of Eastern Ostyak is
written in Hungarian, which the analyst does not
speak. An Eastern Ostyak consultant who speaks
Russian translates the Elicitation Corpus from
Russian to Eastern Ostyak. The analyst learns
about Eastern Ostyak from the automatically
generated fact records.
12Other uses of Feature Detection
- A human-readable reference grammar can be
generated from fact records. - A human analyst knows Northern Ostyak, and then
has to translate a document in Eastern Ostyak.
The only reference grammar of Eastern Ostyak is
written in Hungarian, which the analyst does not
speak. An Eastern Ostyak consultant who speaks
Russian translates the Elicitation Corpus from
Russian to Eastern Ostyak. The analyst learns
about Eastern Ostyak from the automatically
generated fact records. - Im not really sure whether the only grammar of
Eastern Ostyak is written in Hungarian. There is
one reference grammar of Northern Ostyak written
in English. All other Ostyak materials are in
Hungarian, Russian, and German. - The Ostyaks are subsistence hunters, and Eastern
Ostyak is nearly extinct, so there is no real
need for government translators. - Other Siberian and Central Asian languages with
similar scarcity of resources may be important.
13Other uses of Feature Detection
- Help a field worker
- Instead of Elicit by day analyze by night (in
order to know what to elicit the next day), go to
sleep and look at the fact records in the
morning. - We have been working with people at EMELD and MPI
Leipzig.
14Feature Detection Spanish
- The girl saw a red book.
- ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
- La niña vió un libro rojo
- A girl saw a red book
- ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
- Una niña vió un libro rojo
- I saw the red book
- ((1,1)(2,2)(3,3)(4,5)(5,4))
- Yo vi el libro rojo
- I saw a red book.
- ((1,1)(2,2)(3,3)(4,5)(5,4))
- Yo vi un libro rojo
- Feature definiteness
- Values definite, indefinite
- Function-of- subj, obj
- Marked-on-head-of- no
- Marked-on-dependent yes
- Marked-on-governor no
- Marked-on-other no
- Add/delete-word no
- Change-in-alignment no
15Feature Detection Chinese
- A girl saw a red book.
- ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
- ? ?? ?? ?? ? ?? ?? ? ? ?
- The girl saw a red book.
- ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
- ?? ?? ? ?? ??? ?
- Feature definiteness
- Values definite, indefinite
- Function-of- subject
- Marked-on-head-of- no
- Marked-on-dependent no
- Marked-on-governor no
- Add/delete-word yes
- Change-in-alignment no
16Feature Detection Chinese
- I saw the red book
- ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
- ??? ?, ? ?? ?
- I saw a red book.
- ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))
- ? ?? ? ?? ??? ? ?
- Feature definitenes
- Values definite, indefinite
- Function-of- object
- Marked-on-head-of- no
- Marked-on-dependent no
- Marked-on-governor no
- Add/delete-word yes
- Change-in-alignment yes
17Feature Detection Hebrew
- A girl saw a red book.
- ((2,1) (3,2)(5,4)(6,3))
- ???? ???? ??? ????
- The girl saw a red book
- ((1,1)(2,1)(3,2)(5,4)(6,3))
- ????? ???? ??? ????
- I saw a red book.
- ((2,1)(4,3)(5,2))
- ????? ??? ????
- I saw the red book.
- ((2,1)(3,3)(3,4)(4,4)(5,3))
- ????? ?? ???? ?????
- Feature definiteness
- Values definite, indefinite
- Function-of- subj, obj
- Marked-on-head-of- yes
- Marked-on-dependent yes
- Marked-on-governor no
- Add-word no
- Change-in-alignment no
18AVENUE Elicitation Corpora
- The Functional-Typological Corpus
- Based on microtheories of meanings that may have
morpho-syntactic realization - The Structural Elicitation Corpus
- Based on sentence structures from the Penn
TreeBank
19The Functional Typological Corpus
- lt/featuregt
- ltfeaturegt
- ltfeature-namegtc-my-polaritylt/feature-namegt
- ltvaluegt
- ltvalue-namegtpolarity-positivelt/value-namegt
- lt/valuegt
- ltvaluegt
- ltvalue-namegtpolarity-negativelt/value-namegt
- lt/valuegt
- ltnotegtStick to the two obvious values of polarity
for now.lt/notegt - lt/featuregt
- Feature Name c-my-polarity
- Values positive, negative
- Note Stick to the two obvious values of polarity
for now.
20Functional Typological Corpus
- In XML
- XSLT scripts can format it into human-readable
text or into data structures. - Currently contains around 50 features and a few
hundred values. - Still under development.
21Functional Typological Corpus Representation of
Who is at the meeting
- ((subj ((np-my-general-type pronoun-type)(np-my-p
erson person-unk) - (np-my-number num-sg)(np-my-animacy anim-human)
- (np-my-function fn-predicatee)
- (np-d-my-distance-from-speaker distance-neutral)
- (np-my-emphasis emph-no-emph)
- (np-my-info-function info-neutral)
- (np-pronoun-exclusivity exclusivity-n/a)
- (np-pronoun-antecedent-function antecedent-n/a)
- (np-pronoun-reflexivity reflexivity-n/a)))
- (predicate ((loc-roles loc-general-at)))
- Continued on next slide
22Continued Who is at the meeting
- (c-my-copula-type locative)(c-my-secondary-type
secondary-copula) (c-my-polarity
polarity-positive) (c-my-function
fn-main-clause)(c-my-general-type
open-question)(gap-function gap-copula-subject)(c-
my-sp-act sp-act-request-information)(c-v-my-gramm
atical-aspect gram-aspect-neutral)(c-v-my-absolute
-tense present) (c-v-my-phase-aspect
durative)(c-my-headedness-rc rc-head-n/a)(c-my-min
or-type minor-n/a)(c-my-restrictivess-rc
rc-restrictive-n/a)(c-my-answer-type
ans-n/a)(c-my-imperative-degree
imp-degree-n/a)(c-my-actor's-status
actor-neutral)(c-my-focus-rc focus-n/a)(c-my-gaps-
function gap-n/a)(c-my-relative-tense
relative-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-
sem-role actor-sem-role-neutral)(c-v-my-lexical-as
pect state))
23Why is the corpus represented as a set of feature
structures?
- Multiple elicitation languages
- Generate the English and Spanish elicitation
corpora from the same internal representation - Easy to add a new elicitation language
- Write a GenKit grammar to generate sentences from
the same internal representation
24Why is the corpus represented as a set of feature
structures?
- Feature structure represents things that are not
expressed in the major language - These things show up as comments in the
elicitation corpus - I am singing (comment female)
- May eventually use pictures and discourse context
- We actually want to elicit the meaning associated
with the feature structure. English and Spanish
are just vehicles for getting at the meaning.
25Corpus Creation Tools
- The elicitation corpus can be changed and new
corpora can be created.
26Motivation for Corpus Creation Tools
- Make new corpora easily
- Add a new tense (e.g., remote past) and
automatically get all the combinations with other
features - Make a specialized corpus for a limited semantic
domain or a specific language family
27Motivation for Corpus Creation Tools
- Combinatorics
- For example, all combinations of person, number,
gender, tense, etc. - Too much bookkeeping for a human corpus creator,
and too time consuming
28Where do the feature structures come from?
- A linguist formulates a Multiply
- The multiply specifies a set of feature structures
29A Multiply
- ((subj ((np-my-general-type pronoun-type
common-noun-type) - (np-my-person person-first person-second
person-third) - (np-my-number num-sg num-pl)
- (np-my-biological-gender bio-gender-male
bio-gender-female) (np-my-function
fn-predicatee))) - (predicate ((np-my-general-type
common-noun-type) - (np-my-definiteness definiteness-minus)
(np-my-person person-third) - (np-my-function predicate))) (c-my-copula-type
role) - (predicate ((adj-my-general-type
quality-type))) (c-my-copula-type attributive)
- (predicate ((np-my-general-type
common-noun-type) - (np-my-person person-third) (np-my-definiteness
definiteness-plus) - (np-my-function predicate))) (c-my-copula-type
identity) - (c-my-secondary-type secondary-copula)
(c-my-polarity all) - (c-my-function fn-main-clause)(c-my-general-type
declarative) - (c-my-speech-act sp-act-state) (c-v-my-grammatical
-aspect gram-aspect-neutral) - (c-v-my-lexical-aspect state) (c-v-my-absolute-ten
se past present future) - (c-v-my-phase-aspect durative))
- This multiply expands to 288 feature structures.
30There is a GUI for making Multiplies
- Demo available on request
31GenKit Grammar
- Use GenKit for generation
- declarative
- (ltsgt gt (ltnpgt ltvpgt ltnpgt ltscgt)
- (((x0 c-my-general-type) c declarative)
- ((x2 verb-form) fin)
- ((x3 c-my-copula-type) (x0
c-my-copula-type)) - ((x4 d-speaker-gender) (x0
d-speaker-gender)) - ((x4 d-hearer-gender) (x0
d-hearer-gender)) - ((x4 d-my-formality) (x0
d-my-formality)) - ((x3 np-my-number) (x0 np-my-number))
- ((x3 np-my-animacy) (x0
np-my-animacy)) - ((x3 np-my-biological-gender) (x0
np-my-biological-gender)) - (x3 (x0 predicate))
- (x1 (x0 subj))
- (x2 x0)))
32GenKit Lexicon
- Pronouns
- (word ((cat n) (root you) (pred pro)
(np-my-person person-second) - (np-my-animacy anim-human)
(np-my-general-type pronoun-type))) -
- (word ((cat n) (root I) (pred pro)
(np-my-person person-first) (np-my-number num-sg)
- (np-my-animacy anim-human)
(np-my-general-type pronoun-type))) -
- (word ((cat n) (root we) (pred pro)
(np-my-person person-first) (np-my-number num-pl)
- (np-my-animacy anim-human)
(np-my-general-type pronoun-type))) -
- (word ((cat n) (root we) (pred pro)
(np-my-person person-first) - (np-my-number num-dual) (np-my-animacy
anim-human) - (np-my-general-type pronoun-type)))
-
- (word ((cat n) (root she) (pred pro)
(np-my-person person-third) - (np-my-number num-sg) (np-my-biological-g
ender bio-gender-female) - (np-my-animacy anim-human)
(np-my-general-type pronoun-type))) -
33Comments are also generated
- I one female sang
- Use comments for things that are not expressed in
English.
34Convert to Elicitation Format(input to
Elicitation Tool)
- original WHO IS AT THE BOX
- full comment
- Sentence WHO IS AT THE BOX
- original I ONE-WOMAN AM PN_FEMALE ONE-WOMAN
- full comment NP1 ONE-WOMAN
- Sentence I AM PN_FEMALE
- original WILL I ONE-WOMAN BE THE TEACHER
- full comment NP1 ONE-WOMAN
- Sentence WILL I BE THE TEACHER
35Eight Basic Steps for Corpus Creation
- Write FVD and format into data structure
- Gather Exclusions (restrictions on co-occurrence
of features - Design the Multiply
- Get a full set of Feature Structures
- Design Grammar and Comments
- Design Lexicon
- Generate Sentences from Feature Structures
- Convert to Elicitation Format
36Can make other types of corpora
- The Elicitation Corpus does not have to be
functional-typological
37Alternative Corpora The Medical Corpus
((subj ((body-parts all) (Poss
((np-my-general-type pronoun-type)
(np-my-person all)
(np-my-number num-sg num-pl)
(np-my-animacy anim-human)
(np-my-use possessive))) (Pred ((symptoms
all)) (c-my-general-type declarative) (c-my-spee
ch-act sp-act-state) (c-v-my-grammatical-aspect
gram-aspect-neutral) (c-v-my-lexical-aspect
state) (c-v-my-absolute-tense present))
- Feature Body-PartsValues
- part-hand Restrictions
- part-finger Restrictions
- part-tooth Restrictions symptom_redness
- symptom_scratch symptom_numbness
- symptom_cut symptom_lump
- symptom_rash
- symptom_puncture
- symptom_bruise
- symptom_frozen
- part-eye Restrictions symptom_rash
- part-arm Restrictions
-
The Result YOUR ARM IS RED YOUR ARM IS
SCRATCHED YOUR ARM IS NUMB YOUR ARM IS
NIL YOUR ARM HAS A/N INFECTION
38Corpus Navigation
- While the Elicitation Corpus for any one target
language (TL) can be kept to a reasonable size,
the universal Elicitation Corpus must check for
all phenomena that might occur in any langauge. - Since the universal corpus cannot be kept to a
reasonable size, Corpus Navigation is necessary.
- Facts discovered about a particular TL early in
the process constrain what needs to be looked for
later in the process for that TL. Thus this is a
dynamic process, different for each TL.
39Corpus Navigation search
- Search process, with the informant in the inner
loop, expanding search states he/she is given as
SL sentences by supplying the corresponding TL
sentence and alignments. - Analogously to game search, there is an "opening
book" of moves (SL sentences to check for all
languages), until enough inforrmation has been
gathered to make intelligent search choices. - The hueristic function driving the search process
is Relative Info Gain - RIG(YX) H(Y) - H(YX)/H(Y)
- The system reduces the remaining entropy in its
knowledge of the language as much as possible. - There should also be a cost factor, estimating
the human effort required to expand the node. - To make the process efficient enough, we will
create "decision graphs", similar to RETE
networks, that cache information so only the
information that changes needs to be recomputed.