NICE: Native Language Interpretation and Communication Environment - PowerPoint PPT Presentation

About This Presentation

Title:

NICE: Native Language Interpretation and Communication Environment

Description:

Rapid development of machine translation for low and very low ... kudu.le.me.we.la.n. lay_down.st.Hh.rem.neg.ind.1S. I am not going to lay down there any more ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 53

Provided by: loril8

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: NICE: Native Language Interpretation and Communication Environment

1
NICE Native Language Interpretation and
Communication Environment

Lori Levin, Jaime Carbonell, Alon Lavie, Ralf
Brown, Erik Peterson, Katharina Probst, Rodolfo
Vega, Hal Daume
Language Technologies Institute
Carnegie Mellon University
April 12, 2001

2
NICE

Rapid development of machine translation for low
and very low density languages

3
Classification of MT by Language Density

High density pairs (E-F, E-S, E-J, )
Statistical or traditional MT approaches are O.K.
Medium density (E-Czech, E-Croatian, )
Example-based MT (success with Croatian, Korean)
JHU initial success with stat-MT (Czech)
Low density (S-Mapudungun, E-Iñupiaq, )
10,000 to 1 million speakers
Insufficient bilingual corpora for SMT, EBMT
Partial corpus-based resources
Insufficient trained computational linguists

4
Machine Translation of Very Low Density Languages

No text in electronic form
Cant apply current methods for statistical MT
No standard spelling or orthography
Few literate native speakers
Few linguists familiar with the language
Nobody is available to do rule-based MT
Not enough money or time for years of linguistic
information gathering/analysis
E.g., Siona (Colombia)

5
Motivation for LDMT

Methods developed for languages with very scarce
resources will generalize to all MT.
Policy makers can get input from indigenous
people.
E.g., Has there been an epidemic or a crop
failure
Indigenous people can participate in government,
education, and internet without losing their
language.
First MT of polysynthetic languages

6
New Ideas

MT without large amounts of text and without
trained linguists
Machine learning of rule-based MT
Multi-Engine architecture can flexibly take
advantage of whatever resources are available.
Research partnerships with indigenous communities
(Future Exponential models for data-miserly SMT)

7
History of NICE

Arose from a series of joint workshops of NSF and
OAS-CICAD.
Workshop recommendations
Create multinational projects using information
technology to
provide immediate benefits to governments and
citizens
develop critical infrastructure for communication
and collaborative research
training researchers and engineers
advancing science and technology

8
Approach

Machine learning
Uncontrolled corpus (Generalized Example-Based
MT)
Controlled corpus elicited from native speakers
(Version Space Learning)
Multi-Engine MT
Flexibly adapt to whatever resources are
available
Take advantage of the strengths of different MT
approaches

9
Evaluation Objective

To achieve a given level of translation quality
for a series of languages L1 to Ln
Reduce the amount of training data required
Reduce the amount of language-specific
development time after language-independent
software has been developed

10
Evaluation Baseline From Previous Work
(Generalized EBMT)

High density languages (French, Spanish)
1MW parallel corpora (e.g., subset of Hansards)
Consistent spelling, grammatically correct
High coverage, gisting-quality translation

11
Evaluation Baseline GEBMT French Hansards
Coverage (in percent) as a function of corpus
size (in millions of words)
12
Long-Term Target Reduction in Linguistic and
Human Resources
13
Work Completed
14
Establishing Partnerships
15
NICE Partners
16
Nice/MapudungunCurrent Products

Writing conventions (Grafemario)
Glossary Mapudungun/Spanish
Bilingual newspaper, 4 issues
Ultimas Familias memoirs
Memorias de Pascual Coña
6 hours transcribed speech
40 hours recorded speech

17
Instructible Rule-Based MT
18
iRBMT Instructible Rule Based MT
19
Elicitation Process

Purpose controlled elicitation of data that will
be input to machine learning of translation rules

20
Elicitation Interface Example
21
Elicitation Interface

Native informant sees source language sentence
(in English or Spanish)
Native informant types in translation, then uses
mouse to add word alignments
Informant is
Literate
Bilingual
Not an expert in linguistics or in linguistics or
computation

22
The Learning Process

Learning Instance
English the big boy Hebrew ha-yeled ha-gadol
Acquired Transfer Rule
Hebrew NP N ADJ ltgt English NP the ADJ
N
where (HebrewN ltgt English N)
(HebrewADJ ltgt EnglishADJ)
(HebrewN has ((def )))
(HebrewADJ has ((def )))

23
Standard Version Space Learning

Hypothesis Space of all possible rules consistent
with data seen so far
Represented by a generalization lattice bounded
by S (most specific) and G (most general)
boundaries
New positive instances (translation pairs)
generalize S
New negative instances (incorrect translations)
specialize G
Converge when S and G intersect
Problem worse case exponential blow-up

24
Locally-Constrained, Seeded Version Spaces

Preferred generalization level (e.g.
Parts-of-speech linguistic features semantic
features)
First translation pair generalized to preferred
level gt seed the VS
Define P max levels of seed generalization or
specialization (i.e. how close is initial guess)
Generate S/P and G/P boundaries, and apply VS
learning
Allow mutation operator if S/P and G/P prove
incorrect

25
Advantages of Seeded Version Spaces

Worst case polynomial with degree P gt
"tractable"
Generalization level can be estimated reasonably
well for MT transfer rules gt good seeds
Faster convergence, requiring less training data

26
Version Space Abstraction Lattice
27
The Elicitation Corpus

List of sentences in a major language
English
Spanish
Dynamically adaptable
Different sentences are presented depending on
what was previously elicited
Compositional
Joe, Joes brother, I saw Joes brother, I told
you that I saw Joes brother, etc.
Aim for typological completeness
Cover all types of languages

28
Pilot Version of Elicitation Corpus

Approximately 800 sentences
Tested on Swahili
Vocabulary
Include a variety of semantic classes e.g.,
animate, inanimate, man-made objects, natural
objects, etc.
Noun phrases
Detect number, gender, types of possessives,
classifiers, etc.
Basic sentences
Detect agreement between verb and subject and/or
object, basic word order, problems with
indefinite or inanimate subjects, etc.
Complex constructions
Currently relative clauses. Later, comparatives,
questions, embedded clauses, etc.

29
Detection of Grammatical Features

Each language uses a different inventory of
grammatical features tense, number, person,
agreement.

Swahili The hunter kill-ed the animal Mwindaji
a-li-mu-ua mnyama a class-one subject li past
tense mu class-one object ua kill
Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him
Ne-waapam-ek-wa me-see-indirect-he
30
Organization of Tests
Dual
Plural
Paucal
Diagnostic Tests

Subj-V Agr

31
Demo of Elicitation Interface and Feature
Detection
32
Data Collection
33
Mapudungun Data

Spanish-Mapudungun parallel corpora
Total words 223,366
Spanish-Mapudungun glossary
About 5500 entries
40 hours of speech recorded
6 hours of speech transcribed
Speech data will be translated into Spanish

34
Progress and Plans
35
Summary of Year 1Partnerships

Establishment of a partnership with the Institute
for Indigenous Studies at the Universidad de la
Frontera (UFRO) in Chile.
Establishment of a partnership with the Chilean
Ministry of Education.
Identified partners in Alaska and Colombia.
Details of the partnership are being discussed.

36
Summary of Year 1 Data

Spanish-Mapudungun parallel corpus over 200,000
words
Standardization of orthography Linguists at UFRO
have evaluated the competing orthographies for
Mapudungun and written a report detailing their
recommendations for a standardized orthography
for NICE.
Training for spoken language collection In
January 2001 native speakers of Mapudungun were
trained in the recording and transcription of
spoken data.
Mapudungun spoken language corpus 40 hours
recorded, 6 hours transcribed (as of end of
February).

37
Summary of Year 1 iKBMT

Preliminary design of transfer rule formalism for
machine translation.
Design and pilot testing of prototype elicitation
corpus.
First prototype of feature detection
Morphological processing in PC Kimmo covering
about 40 Mapudungun morphemes.
Preliminary version of new parser for run-time
translation component.

38
Goals for Year 2 Data

Continue collection, transcription, and
translation of Mapudungun data.
Take inventory of existing Inupiaq data available
from the Alaska Native Languages Center and the
Inupiaq community.
Focus on the North Slope dialect and other
dialects that are easily intelligible to North
Slope speakers. Type and record additional
Inupiaq data as needed.
Plans for Siona data collection will be discussed
at a meeting in Bogota in May.

39
Goals for Year 2 Elicitation Corpus

Extend the elicitation corpus with more complex
constructions (such as causatives and
comparatives) and add diagnostics for complex
features such as the tense and aspect system.
Refine elicitation interface based on preliminary
experiments.
Preliminary user studies with the corpus and
interface using at least two languages.
Refine the linguistic corpus so as to accelerate
learning of the more common and useful structures
first.

40
Goals for Year 2 EBMT

Baseline EBMT systems for Mapudungun and Inupiaq.
Extend baseline systems with preliminary version
of linguistic generalization.

41
Goals for Year 2 MT Run-time System

Develop learnable transfer-rule structure and
interpreter.
Unlike existing hand-coded transfer system for
machine translation, a learnable structure
requires full compositionality and
component-wise generalizability/specializability
for data-driven inductive learning.
Develop morphological processors and part of
speech taggers for Mapudungun and Spanish.

42
Goals for Year 2 Version Space Learning

Develop baseline Seeded-Version-Space (SVS)
inductive learning method
Extend the elicitation interface to enable the
SVS system to generate questions for the native
informant, so as to speed the transfer-rule
learning process

43
Future Projects

Discussion

44
Appendix
45
The IEI Team

Coordinator (leader of a bilingual and
multicultural education project)
Distinguished native speaker
Linguists (one native speaker, one near-native)
Typists/Transcribers
Recording assistants
Translators
Native speaker linguistic informants

46
Agreement Between LTI and Institute of Indigenous
Studies (IEI), Universidad De La Frontera, Chile

Contributions of IEI
Socio-linguistic knowledge
Linguistic knowledge
Experience in multicultural bilingual education
The use of IEI facilities, faculty/researchers
and staff for the project
electronic network support and computer technical
support

47
Agreement between LTI and Institute of Indigenous
Studies (IEI), Universidad de la Frontera, Chile

Contributions of LTI
Equipment four computers and four DAT recorders
Payment of consulting fees pending funding from
the Chilean Ministry of Education
Expertise in language technologies

48
LTI/IEI Agreement

Cooperate in expanding the project to convergent
areas, such as bilingual education, as well as in
pursuing additional funding

49
MINEDUC/IEIAgreement Highlights

Based on the LTI/IEI agreement, the Chilean
Ministry of Education got involved in funding the
data collection and processing team for the year
2001. This agreement will be renewed each year,
as needed.

50
MINEDUC/IEI Agreement

Objectives
To evaluate the NICE/Mapudungun proposal for
orthography and spelling
To collect an oral corpus that represent the four
Mapudungun dialects spoken in Chile. The main
domain is primary health, traditional and
Occidental.

51
MINEDUC/IEI Agreement

Deliverables
An oral corpus of 800 hours recorded,
proportional to the demography of each current
spoken dialect
120 hours transcribed and translated from
Mapudungun to Spanish
A refined proposal for writing Mapudungun

52
Mapudungun Morphology