Title: NICE: Native Language Interpretation and Communication Environment
1NICE Native Language Interpretation and
Communication Environment
- Lori Levin, Jaime Carbonell, Alon Lavie, Ralf
Brown, Erik Peterson, Katharina Probst, Rodolfo
Vega, Hal Daume - Language Technologies Institute
- Carnegie Mellon University
- April 12, 2001
2NICE
- Rapid development of machine translation for low
and very low density languages
3Classification of MT by Language Density
- High density pairs (E-F, E-S, E-J, )
- Statistical or traditional MT approaches are O.K.
- Medium density (E-Czech, E-Croatian, )
- Example-based MT (success with Croatian, Korean)
- JHU initial success with stat-MT (Czech)
- Low density (S-Mapudungun, E-Iñupiaq, )
- 10,000 to 1 million speakers
- Insufficient bilingual corpora for SMT, EBMT
- Partial corpus-based resources
- Insufficient trained computational linguists
4Machine Translation of Very Low Density Languages
- No text in electronic form
- Cant apply current methods for statistical MT
- No standard spelling or orthography
- Few literate native speakers
- Few linguists familiar with the language
- Nobody is available to do rule-based MT
- Not enough money or time for years of linguistic
information gathering/analysis - E.g., Siona (Colombia)
5Motivation for LDMT
- Methods developed for languages with very scarce
resources will generalize to all MT. - Policy makers can get input from indigenous
people. - E.g., Has there been an epidemic or a crop
failure - Indigenous people can participate in government,
education, and internet without losing their
language. - First MT of polysynthetic languages
6New Ideas
- MT without large amounts of text and without
trained linguists - Machine learning of rule-based MT
- Multi-Engine architecture can flexibly take
advantage of whatever resources are available. - Research partnerships with indigenous communities
- (Future Exponential models for data-miserly SMT)
7History of NICE
- Arose from a series of joint workshops of NSF and
OAS-CICAD. - Workshop recommendations
- Create multinational projects using information
technology to - provide immediate benefits to governments and
citizens - develop critical infrastructure for communication
and collaborative research - training researchers and engineers
- advancing science and technology
8Approach
- Machine learning
- Uncontrolled corpus (Generalized Example-Based
MT) - Controlled corpus elicited from native speakers
(Version Space Learning) - Multi-Engine MT
- Flexibly adapt to whatever resources are
available - Take advantage of the strengths of different MT
approaches
9Evaluation Objective
- To achieve a given level of translation quality
for a series of languages L1 to Ln - Reduce the amount of training data required
- Reduce the amount of language-specific
development time after language-independent
software has been developed
10Evaluation Baseline From Previous Work
(Generalized EBMT)
- High density languages (French, Spanish)
- 1MW parallel corpora (e.g., subset of Hansards)
- Consistent spelling, grammatically correct
- High coverage, gisting-quality translation
11Evaluation Baseline GEBMT French Hansards
Coverage (in percent) as a function of corpus
size (in millions of words)
12Long-Term Target Reduction in Linguistic and
Human Resources
13Work Completed
14Establishing Partnerships
15NICE Partners
16Nice/MapudungunCurrent Products
- Writing conventions (Grafemario)
- Glossary Mapudungun/Spanish
- Bilingual newspaper, 4 issues
- Ultimas Familias memoirs
- Memorias de Pascual Coña
- 6 hours transcribed speech
- 40 hours recorded speech
17Instructible Rule-Based MT
18iRBMT Instructible Rule Based MT
19Elicitation Process
- Purpose controlled elicitation of data that will
be input to machine learning of translation rules
20Elicitation Interface Example
21Elicitation Interface
- Native informant sees source language sentence
(in English or Spanish) - Native informant types in translation, then uses
mouse to add word alignments - Informant is
- Literate
- Bilingual
- Not an expert in linguistics or in linguistics or
computation
22The Learning Process
- Learning Instance
- English the big boy Hebrew ha-yeled ha-gadol
- Acquired Transfer Rule
- Hebrew NP N ADJ ltgt English NP the ADJ
N - where (HebrewN ltgt English N)
- (HebrewADJ ltgt EnglishADJ)
- (HebrewN has ((def )))
- (HebrewADJ has ((def )))
23Standard Version Space Learning
- Hypothesis Space of all possible rules consistent
with data seen so far - Represented by a generalization lattice bounded
by S (most specific) and G (most general)
boundaries - New positive instances (translation pairs)
generalize S - New negative instances (incorrect translations)
specialize G - Converge when S and G intersect
- Problem worse case exponential blow-up
24Locally-Constrained, Seeded Version Spaces
- Preferred generalization level (e.g.
Parts-of-speech linguistic features semantic
features) - First translation pair generalized to preferred
level gt seed the VS - Define P max levels of seed generalization or
specialization (i.e. how close is initial guess) - Generate S/P and G/P boundaries, and apply VS
learning - Allow mutation operator if S/P and G/P prove
incorrect
25Advantages of Seeded Version Spaces
- Worst case polynomial with degree P gt
"tractable" - Generalization level can be estimated reasonably
well for MT transfer rules gt good seeds - Faster convergence, requiring less training data
26Version Space Abstraction Lattice
27The Elicitation Corpus
- List of sentences in a major language
- English
- Spanish
- Dynamically adaptable
- Different sentences are presented depending on
what was previously elicited - Compositional
- Joe, Joes brother, I saw Joes brother, I told
you that I saw Joes brother, etc. - Aim for typological completeness
- Cover all types of languages
28Pilot Version of Elicitation Corpus
- Approximately 800 sentences
- Tested on Swahili
- Vocabulary
- Include a variety of semantic classes e.g.,
animate, inanimate, man-made objects, natural
objects, etc. - Noun phrases
- Detect number, gender, types of possessives,
classifiers, etc. - Basic sentences
- Detect agreement between verb and subject and/or
object, basic word order, problems with
indefinite or inanimate subjects, etc. - Complex constructions
- Currently relative clauses. Later, comparatives,
questions, embedded clauses, etc.
29Detection of Grammatical Features
- Each language uses a different inventory of
grammatical features tense, number, person,
agreement.
Swahili The hunter kill-ed the animal Mwindaji
a-li-mu-ua mnyama a class-one subject li past
tense mu class-one object ua kill
Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him
Ne-waapam-ek-wa me-see-indirect-he
30Organization of Tests
Dual
Plural
Paucal
Diagnostic Tests
Subj-V Agr
31Demo of Elicitation Interface and Feature
Detection
32Data Collection
33Mapudungun Data
- Spanish-Mapudungun parallel corpora
- Total words 223,366
- Spanish-Mapudungun glossary
- About 5500 entries
- 40 hours of speech recorded
- 6 hours of speech transcribed
- Speech data will be translated into Spanish
34Progress and Plans
35Summary of Year 1Partnerships
- Establishment of a partnership with the Institute
for Indigenous Studies at the Universidad de la
Frontera (UFRO) in Chile. - Establishment of a partnership with the Chilean
Ministry of Education. - Identified partners in Alaska and Colombia.
Details of the partnership are being discussed.
36Summary of Year 1 Data
- Spanish-Mapudungun parallel corpus over 200,000
words - Standardization of orthography Linguists at UFRO
have evaluated the competing orthographies for
Mapudungun and written a report detailing their
recommendations for a standardized orthography
for NICE. - Training for spoken language collection In
January 2001 native speakers of Mapudungun were
trained in the recording and transcription of
spoken data. - Mapudungun spoken language corpus 40 hours
recorded, 6 hours transcribed (as of end of
February).
37Summary of Year 1 iKBMT
- Preliminary design of transfer rule formalism for
machine translation. - Design and pilot testing of prototype elicitation
corpus. - First prototype of feature detection
- Morphological processing in PC Kimmo covering
about 40 Mapudungun morphemes. - Preliminary version of new parser for run-time
translation component. -
38Goals for Year 2 Data
- Continue collection, transcription, and
translation of Mapudungun data. - Take inventory of existing Inupiaq data available
from the Alaska Native Languages Center and the
Inupiaq community. - Focus on the North Slope dialect and other
dialects that are easily intelligible to North
Slope speakers. Type and record additional
Inupiaq data as needed. - Plans for Siona data collection will be discussed
at a meeting in Bogota in May.
39Goals for Year 2 Elicitation Corpus
- Extend the elicitation corpus with more complex
constructions (such as causatives and
comparatives) and add diagnostics for complex
features such as the tense and aspect system. - Refine elicitation interface based on preliminary
experiments. - Preliminary user studies with the corpus and
interface using at least two languages. - Refine the linguistic corpus so as to accelerate
learning of the more common and useful structures
first.
40Goals for Year 2 EBMT
- Baseline EBMT systems for Mapudungun and Inupiaq.
- Extend baseline systems with preliminary version
of linguistic generalization.
41Goals for Year 2 MT Run-time System
- Develop learnable transfer-rule structure and
interpreter. - Unlike existing hand-coded transfer system for
machine translation, a learnable structure
requires full compositionality and
component-wise generalizability/specializability
for data-driven inductive learning. - Develop morphological processors and part of
speech taggers for Mapudungun and Spanish.
42Goals for Year 2 Version Space Learning
- Develop baseline Seeded-Version-Space (SVS)
inductive learning method - Extend the elicitation interface to enable the
SVS system to generate questions for the native
informant, so as to speed the transfer-rule
learning process
43Future Projects
44Appendix
45The IEI Team
- Coordinator (leader of a bilingual and
multicultural education project) - Distinguished native speaker
- Linguists (one native speaker, one near-native)
- Typists/Transcribers
- Recording assistants
- Translators
- Native speaker linguistic informants
46Agreement Between LTI and Institute of Indigenous
Studies (IEI), Universidad De La Frontera, Chile
- Contributions of IEI
- Socio-linguistic knowledge
- Linguistic knowledge
- Experience in multicultural bilingual education
- The use of IEI facilities, faculty/researchers
and staff for the project - electronic network support and computer technical
support
47Agreement between LTI and Institute of Indigenous
Studies (IEI), Universidad de la Frontera, Chile
- Contributions of LTI
- Equipment four computers and four DAT recorders
- Payment of consulting fees pending funding from
the Chilean Ministry of Education - Expertise in language technologies
48LTI/IEI Agreement
- Cooperate in expanding the project to convergent
areas, such as bilingual education, as well as in
pursuing additional funding
49MINEDUC/IEIAgreement Highlights
- Based on the LTI/IEI agreement, the Chilean
Ministry of Education got involved in funding the
data collection and processing team for the year
2001. This agreement will be renewed each year,
as needed.
50MINEDUC/IEI Agreement
- Objectives
- To evaluate the NICE/Mapudungun proposal for
orthography and spelling - To collect an oral corpus that represent the four
Mapudungun dialects spoken in Chile. The main
domain is primary health, traditional and
Occidental.
51MINEDUC/IEI Agreement
- Deliverables
- An oral corpus of 800 hours recorded,
proportional to the demography of each current
spoken dialect - 120 hours transcribed and translated from
Mapudungun to Spanish - A refined proposal for writing Mapudungun
52Mapudungun Morphology
- kudu.le.me.we.la.n
- lay_down.st.Hh.rem.neg.ind.1S
- I am not going to lay down there any more
- illku.faluw.kUle.n
- get_angry.SIM.ST.IND.1s
- I am pretending to be angry
- antU.kUdaw.kiaw.ke.rke.fu.y
- day.work.CIRC.CF.REP.IPD.IND.3s
- he used to work here and there as a day laborer,
I am told - wisa.ka.dungu.fe.nge.y.mi
- bad.VERB.FAC.speak.NOM.VERB.IND.2s
- you are someone who always does and says nasty
things