An Overview of the AVENUE Project - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

An Overview of the AVENUE Project

Description:

My name is Lori. Transfer Rules. Direct: SMT, EBMT. AVENUE: ... Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI) Rule learning: Katharina Probst ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 83
Provided by: lsl
Category:

less

Transcript and Presenter's Notes

Title: An Overview of the AVENUE Project


1
An Overview of the AVENUE Project
  • Presented by
  • Lori Levin
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • Pittsburgh, PA USA

2
AVENUE Project
  • Dr. Jaime Carbonell, PI
  • Dr. Alon Lavie, Co-PI
  • Dr. Lori Levin, Co-PI
  • Dr. Robert Frederking
  • Dr. Ralf Brown
  • Dr. Rodolfo Vega
  • Mapudungun
  • Dr. Eliseo Cañulef
  • Rosendo Huisca
  • and others
  • Erik Peterson
  • Christian Monson
  • Ariadna Font Llitjós
  • Alison Alvarez
  • Roberto Aranovich
  • Dr. Jeff Good
  • Dr. Katharina Probst
  • Hebrew
  • Dr. Shuly Wintner
  • student

This research was funded in part by NSF grant
number IIS-0121-631.
3
MT Approaches
  • Interlingua
    introduce-self

Sentence Planning
Semantic Analysis
Syntactic Parsing Pronoun-acc-1-sg chiamare-1sg N
Text Generation np poss-1sg name BE-pres N
Transfer Rules
AVENUE Automate Rule Learning
Source Mi chiamo Lori
Target My name is Lori
Direct SMT, EBMT
4
Approaches to MT
  • Direct
  • Works best with large parallel corpora
  • Millions of words
  • Can be done without linguistic resources
  • Interlingua
  • Useful when you are translating between more than
    two languages
  • Requires linguistic knowledge
  • Transfer
  • Requires linguistic knowledge

5
Useful Resources for MT
  • Parallel corpus
  • Monolingual corpus
  • Lexicon
  • Morphological Analyzer (lemmatizer)
  • Human Linguist
  • Human non-linguist

6
Low Resource Situations
  • Indigenous languages
  • May lack large corpora
  • May lack a computational linguist
  • Strategic Languages
  • Aside from standard written Arabic and Chinese
  • Resource-rich language limited domain
  • Most of the large parallel corpora are newspaper,
    parliamentary proceedings, or broadcast news
  • Fewer resources for conversation related to
    humanitarian aid.

7
Why Machine Translation for Languages with
Limited Resources?
  • We are in the age of information explosion
  • The internetwebGoogle ? anyone can get the
    information they want anytime
  • But what about the text in all those other
    languages?
  • How do they read all this English stuff?
  • How do we read all the stuff that they put
    online?
  • MT for these languages would Enable
  • Better government access to native indigenous and
    minority communities
  • Better minority and native community
    participation in information-rich activities
    (health care, education, government) without
    giving up their languages.
  • Civilian and military applications (disaster
    relief)
  • Language preservation

8
Mixed Resource Situations
  • Some resources are available and others arent.

9
Omnivorous MT
  • Eat whatever resources are available
  • Eat large or small amounts of data

10
AVENUEs Inventory
  • Resources
  • Parallel corpus
  • Monolingual corpus
  • Lexicon
  • Morphological Analyzer (lemmatizer)
  • Human Linguist
  • Human non-linguist
  • Techniques
  • Rule based transfer system
  • Example Based MT
  • Morphology Learning
  • Rule Learning
  • Interactive Rule Refinement
  • Multi-Engine MT

11
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
12
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
13
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
14
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
15
AVENUE
  • Rules can be written by hand or learned
    automatically.
  • Hybrid
  • Rule-based transfer
  • Statistical decoder
  • Multi-engine combinations with SMT and EBMT

16
AVENUE systems(Small and experimental, but
tested on unseen data)
  • Hebrew-to-English
  • Alon Lavie, Shuly Wintner, Katharina Probst
  • Hand-written and automatically learned
  • Automatic rules trained on 120 sentences perform
    slightly better than about 20 hand-written rules.
  • Hindi-to-English
  • Lavie, Peterson, Probst, Levin, Font, Cohen,
    Monson
  • Automatically learned
  • Performs better than SMT when training data is
    limited to 50K words

17
AVENUE systems(Small and experimental, but
tested on unseen data)
  • English-to-Spanish
  • Ariadna Font Llitjos
  • Hand-written, automatically corrected
  • Mapudungun-to-Spanish
  • Roberto Aranovich and Christian Monson
  • Hand-written
  • Dutch-to-English
  • Simon Zwarts
  • Hand-written

18
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
19
Elicitation
  • Get data from someone who is
  • Bilingual
  • Literate
  • With consistent spelling
  • Not experienced with linguistics

20
English-Hindi Example
Elicitation Tool Erik Peterson
21
English-Chinese Example
Note Translator has to insert spaces between
words in Chinese.
22
English-Arabic Example
23
Purpose of Elicitation
  • srcsent Tú caíste
  • tgtsent eymi ütrünagimi
  • aligned ((1,1),(2,2))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) fell
  • srcsent Tú estás cayendo
  • tgtsent eymi petu ütünagimi
  • aligned ((1,1),(2 3,2 3))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) are falling
  • srcsent Tú caíste
  • tgtsent eymi ütrunagimi
  • aligned ((1,1),(2,2))
  • context tú María femenino, 2a persona del
    singular
  • comment You (Mary) fell
  • Provide a small but highly targeted corpus of
    hand aligned data
  • To support machine learning from a small data set
  • To discover basic word order
  • To discover how syntactic dependencies are
    expressed
  • To discover which grammatical meanings are
    reflected in the morphology or syntax of the
    language

24
Languages
  • The set of feature structures with English
    sentences has been delivered to the Linguistic
    Data Consortium as part of the Reflex program.
  • Translated (by LDC) into
  • Thai
  • Bengali
  • Plans to translate into
  • Seven strategic languages per year for five
    years.
  • As one small part of a language pack (BLARK) for
    each language.

25
Languages
  • Spanish version in progress at New Mexico State
    University (Helmreich and Cowie)
  • Plans to translate into Guarani
  • Portuguese version in progress in Brazil
    (Marcello Modesto)
  • Plans to translate into Karitiana
  • 200 speakers
  • Plans to translate into Inupiaq (Kaplan and
    MacLean)

26
Previous Elicitation Work
  • Pilot corpus
  • Around 900 sentences
  • No feature structures
  • Mapudungun
  • Two partial translations
  • Quechua
  • Three translations
  • Aymara
  • Seven translations
  • Hebrew
  • Hindi
  • Several translations
  • Dutch

27
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
28
AVENUE Machine Translation System
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )
  • Type information
  • Synchronous Context Free Rules
  • Alignments
  • x-side constraints
  • y-side constraints
  • xy-constraints,
  • e.g. ((Y1 AGR) (X1 AGR))

Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori
Levin (Co-PI) Rule learning Katharina Probst
29
Rule Learning - Overview
  • Goal Acquire Syntactic Transfer Rules
  • Use available knowledge from the major-language
    side (grammatical structure)
  • Three steps
  • Flat Seed Generation first guesses at transfer
    rules flat syntactic structure
  • Compositionality Learning use previously learned
    rules to learn hierarchical structure
  • Constraint Learning refine rules by learning
    appropriate feature constraints

30
Flat Seed Rule Generation
31
Flat Seed Rule Generation
  • Create a flat transfer rule specific to the
    sentence pair, partially abstracted to POS
  • Words that are aligned word-to-word and have the
    same POS in both languages are generalized to
    their POS
  • Words that have complex alignments (or not the
    same POS) remain lexicalized
  • One seed rule for each translation example
  • No feature constraints associated with seed rules
    (but mark the example(s) from which it was
    learned)

32
Compositionality Learning
33
Compositionality Learning
  • Detection traverse the c-structure of the
    English sentence, add compositional structure for
    translatable chunks
  • Generalization adjust constituent sequences and
    alignments
  • Two implemented variants
  • Safe Compositionality there exists a transfer
    rule that correctly translates the
    sub-constituent
  • Maximal Compositionality Generalize the rule if
    supported by the alignments, even in the absence
    of an existing transfer rule for the
    sub-constituent

34
Constraint Learning
35
Constraint Learning
  • Goal add appropriate feature constraints to the
    acquired rules
  • Methodology
  • Preserve general structural transfer
  • Learn specific feature constraints from example
    set
  • Seed rules are grouped into clusters of similar
    transfer structure (type, constituent sequences,
    alignments)
  • Each cluster forms a version space a partially
    ordered hypothesis space with a specific and a
    general boundary
  • The seed rules in a group form the specific
    boundary of a version space
  • The general boundary is the (implicit) transfer
    rule with the same type, constituent sequences,
    and alignments, but no feature constraints

36
Transfer and Decoding
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
37
The Transfer Engine
38
Symbolic Decoder
  • System rarely finds a full parse/transfer for
    complete input sentence
  • XFER engine produces comprehensive lattice of
    segment translations
  • Decoder selects best combination of translation
    segments
  • Search for optimal scoring path of partial
    translations, based on multiple features
  • Target Language Model scores
  • XFER Rule Scores
  • Path Fragmentation
  • Other features
  • Symbolic decoding essential for scenarios where
    there is insufficient data for training large
    target LM
  • Effective Rule Scoring is crucial

39
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
40
Rule Refinement
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
41
Interactive and Automatic Refinement of
Translation Rules
  • Problem Improve Machine Translation quality.
  • Proposed Solution Put bilingual speakers back
    into the loop use their corrections to detect
    the source of the error and automatically improve
    the lexicon and the grammar.
  • Approach Automate post-editing efforts by
    feeding them back into the MT system.
  • Automatic refinement of translation rules that
    caused an error beyond post-editing.
  • Goal Improve MT coverage and overall quality.

42
Technical Challenges
Automatic Evaluation of Refinement process
Elicit minimal MT information from non-expert
users
43
Error Typology for Automatic Rule Refinement
(simplified)
  • Missing word
  • Extra word
  • Wrong word order
  • Incorrect word
  • Wrong agreement

44
TCTool (Demo)
Interactive elicitation of error information
  • Add a word
  • Delete a word
  • Modify a word
  • Change word order

Actions
45
Types of Refinement Operations
Automatic Rule Adaptation
  • 1. Refine a translation rule
  • R0 ? R1 (change R0 to make it more specific
    or more general)

R0
una casa bonito
a nice house
R1
N gender ADJ gender
a nice house
una casa bonita
46
Types of Refinement Operations
Automatic Rule Adaptation
  • 2. Bifurcate a translation rule
  • R0 ? R0 (same, general rule)
  • ? R1 (add a new more specific rule)

R0
una casa bonita
a nice house
R1
ADJ type pre-nominal
un gran artista
a great artist
47
Automatic Rule Adaptation
A concrete example
Error Information Elicitation

error
Change word order SL Gaudí was a great artist
MT system output TL Gaudí era un artista
grande Ucorrection Gaudí era un artista
grande Gaudí era un gran artista
correction
clue word
Refinement Operation Typology
48
Mapudungun
  • Indigenous Language of Chile and Argentina
  • 1 Million Mapuche Speakers

49
Mapudungun Language
  • 900,000 Mapuche people
  • At least 300.000 speakers of Mapudungun
  • Polysynthetic
  • sl pe- rke- fi- ñ
    Maria
  • ver-REPORT-3pO-1pSgS/IND
  • tl DICEN QUE LA VI A MARÍA
  • (They say that) I saw Maria.

50
AVENUE Mapudungun
  • Joint project between Carnegie Mellon University,
    the Chilean Ministry of Education, and
    Universidad de la Frontera.

51
Mapudungun to Spanish Resources
  • Initially
  • Large team of native speakers at Universidad de
    la Frontera, Temuco, Chile
  • Some knowledge of linguistics
  • No knowledge of computational linguistics
  • No corpus
  • A few short word lists
  • No morphological analyzer
  • Later Computational Linguists with non-native
    knowledge of Mapudungun
  • Other considerations
  • Produce something that is useful to the
    community, especially for bilingual education
  • Experimental MT systems are not useful

52
Mapudungun
Corpus 170 hours of spoken Mapudungun
Example Based MT
Spelling checker
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
Spanish Morphology from UPC, Barcelona
53
Mapudungun Products
  • http//www.lenguasamerindias.org/
  • Click traductor mapudungún
  • Dictionary lookup (Mapudungun to Spanish)
  • Morphological analysis
  • Example Based MT (Mapudungun to Spanish)

54
I Didnt see Maria
S
S
VP
VP
NP
NP
a
V
VSuffG
V
no
VSuffG
VSuff
N
pe
vi
N
VSuffG
VSuff
ñ
Maria
María
fi
VSuff
la
55
Transfer to Spanish Top-Down
S
S
VP
VP
VPVP VBar NP -gt VBar "a" NP ( (X1Y1) (X2
Y3) ((X2 type) (NOT personal)) ((X2
human) c ) (X0 X1) ((X0 object) X2)
(Y0 X0) ((Y0 object) (X0 object)) (Y1
Y0) (Y3 (Y0 object)) ((Y1 objmarker person)
(Y3 person)) ((Y1 objmarker number) (Y3
number)) ((Y1 objmarker gender) (Y3 ender)))
NP
NP
a
V
VSuffG
VSuffG
VSuff
N
pe
VSuffG
VSuff
ñ
Maria
fi
VSuff
la
56
Mapudungun
  • Indigenous Language of Chile and Argentina
  • 1 Million Mapuche Speakers

57
Collaboration
Eliseo Cañulef Rosendo Huisca Hugo Carrasco
Hector Painequeo Flor Caniupil Luis Caniupil
Huaiquiñir Marcela Collio Calfunao Cristian
Carrillan Anton Salvador Cañulef
  • Mapuche Language Experts
  • Universidad de la Frontera (UFRO)
  • Instituto de Estudios Indígenas (IEI)
  • Institute for Indigenous Studies
  • Chilean Funding
  • Chilean Ministry of Education (Mineduc)
  • Bilingual and Multicultural Education Program

Carolina Huenchullan Arrúe Claudio Millacura
Salas
58
Accomplishments
  • Corpora Collection
  • Spoken Corpus
  • Collected Luis Caniupil Huaiquiñir
  • Medical Domain
  • 3 of 4 Mapudungun Dialects
  • 120 hours of Nguluche
  • 30 hours of Lafkenche
  • 20 hours of Pwenche
  • Transcribed in Mapudungun
  • Translated into Spanish
  • Written Corpus
  • 200,000 words
  • Bilingual Mapudungun Spanish
  • Historical and newspaper text

nmlch-nmjm1_x_0405_nmjm_00 M ltSPAgtno pütokovilu
kay ko C no, si me lo tomaba con agua M
chumgechi pütokoki femuechi pütokon pu ltNoisegt
C como se debe tomar, me lo tomé
pués nmlch-nmjm1_x_0406_nmlch_00 M
Chengewerkelafuymiürke C Ya no estabas como
gente entonces!
59
Accomplishments
  • Developed At UFRO
  • Bilingual Dictionary with Examples
  • 1,926 entries
  • Spelling Corrected Mapudungun Word List
  • 117,003 fully-inflected word forms
  • Segmented Word List
  • 15,120 forms
  • Stems translated into Spanish

60
Accomplishments
  • Developed at LTI using Mapudungun language
    resources from UFRO
  • Spelling Checker
  • Integrated into OpenOffice
  • Hand-built Morphological Analyzer
  • Prototype Machine Translation Systems
  • Rule-Based
  • Example-Based
  • Website LenguasAmerindias.org

61
AVENUE Hebrew
  • Joint project of Carnegie Mellon University and
    University of Haifa

62
Hebrew Language
  • Native language of about 3-4 Million in Israel
  • Semitic language, closely related to Arabic and
    with similar linguistic properties
  • RootPattern word formation system
  • Rich verb and noun morphology
  • Particles attach as prefixed to the following
    word definite article (H), prepositions
    (B,K,L,M), coordinating conjuction (W),
    relativizers (,K)
  • Unique alphabet and Writing System
  • 22 letters represent (mostly) consonants
  • Vowels represented (mostly) by diacritics
  • Modern texts omit the diacritic vowels, thus
    additional level of ambiguity bare word ? word
  • Example MHGR ? mehager, mhagar, mhger

63
Hebrew Resources
  • Morphological analyzer developed at Technion
  • Constructed our own Hebrew-to-English lexicon,
    based primarily on existing Dahan H-to-E and
    E-to-H dictionary
  • Human Computational Linguists
  • Native Speakers

64
Hebrew
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
65
Flat Seed Rule Generation
66
Compositionality Learning
67
Constraint Learning
68
Challenges for Hebrew MT
  • Paucity in existing language resources for Hebrew
  • No publicly available broad coverage
    morphological analyzer
  • No publicly available bilingual lexicons or
    dictionaries
  • No POS-tagged corpus or parse tree-bank corpus
    for Hebrew
  • No large Hebrew/English parallel corpus
  • Scenario well suited for CMU transfer-based MT
    framework for languages with limited resources

69
Hebrew Morphology Example
  • Input word BWRH
  • 0 1 2 3 4
  • --------BWRH--------
  • -----B-----WR--H--
  • --B---H----WRH---

70
Hebrew Morphology Example
  • Y0 ((SPANSTART 0) Y1 ((SPANSTART 0)
    Y2 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    2) (SPANEND 3)
  • (LEX BWRH) (LEX B)
    (LEX WR)
  • (POS N) (POS
    PREP)) (POS N)
  • (GEN F)
    (GEN M)
  • (NUM S)
    (NUM S)
  • (STATUS ABSOLUTE))
    (STATUS ABSOLUTE))
  • Y3 ((SPANSTART 3) Y4 ((SPANSTART 0)
    Y5 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    1) (SPANEND 2)
  • (LEX LH) (LEX
    B) (LEX H)
  • (POS POSS)) (POS
    PREP)) (POS DET))
  • Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)
  • (SPANEND 4) (SPANEND
    4)
  • (LEX WRH) (LEX
    BWRH)
  • (POS N) (POS
    LEX))
  • (GEN F)
  • (NUM S)

71
Sample Output (dev-data)
  • maxwell anurpung comes from ghana for israel four
    years ago and since worked in cleaning in hotels
    in eilat
  • a few weeks ago announced if management club
    hotel that for him to leave israel according to
    the government instructions and immigration
    police
  • in a letter in broken english which spread among
    the foreign workers thanks to them hotel for
    their hard work and announced that will purchase
    for hm flight tickets for their countries from
    their money

72
Quechua?Spanish MT
  • V-Unit funded Summer project in Cusco (Peru)
    June-August 2005 preparations and data
    collection started earlier
  • Intensive Quechua course in Centro Bartolome de
    las Casas (CBC)
  • Worked together with two Quechua native and one
    non-native speakers on developing infrastructure
    (correcting elicited translations, segmenting and
    translating list of most frequent words)

73
Quechua ? Spanish Prototype MT System
  • Stem Lexicon (semi-automatically generated) 753
    lexical entries
  • Suffix lexicon 21 suffixes
  • (150 Cusihuaman)
  • Quechua morphology analyzer
  • 25 translation rules
  • Spanish morphology generation module
  • User-Studies 10 sentences, 3 users (2 native, 1
    non-native)

74
Quechua facts
  • Agglutinative language
  • A stem can often have 10 to 12 suffixes, but it
    can have up to 28 suffixes
  • Supposedly clear cut boundaries, but in reality
    several suffixes change when followed by certain
    other suffixes
  • No irregular verbs, nouns or adjectives
  • Does not mark for gender
  • No adjective agreement
  • No definite or indefinite articles (topic and
    focus markers perform a similar task of
    articles and intonation in English or Spanish)

75
Quechua examples
  • takini (also written takiniy)
  • sing 1sg (I sing) ? canto
  • takishani (takishaniy)
  • sing progr 1sg (I am singing) ? estoy
    cantando
  • takipakuqchu?
  • taki sing
  • -paku to join a group to do something
  • -q agentive
  • -chu interrogative
  • ? (para) cantar con la gente (del pueblo)?
  • (to sing with the people (of the village)?)

76
Quechua Resources
  • A few native speakers, not linguists
  • A computational linguist learning Quechua
  • Two fluent, but non-native linguists

77
Quechua
Parallel Corpus OCR with correction
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
78
Grammar rules
cantando
  • takishani -gt estoy cantando (I am singing)
  • VBar,3
  • VBarVBar V VSuff VSuff -gt V V
  • ( (X1Y2)
  • ((x0 person) (x3 person))
  • ((x0 number) (x3 number))
  • ((x2 mood) c ger)
  • ((y2 mood) (x2 mood))
  • ((y1 form) c estar)
  • ((y1 person) (x3 person))
  • ((y1 number) (x3 number))
  • ((y1 tense) (x3 tense))
  • ((x0 tense) (x3 tense))
  • ((y1 mood) (x3 mood))
  • ((x3 inflected) c )
  • ((x0 inflected) ))

Spanish Morphology Generation
lex cantar mood ger
lex estar person 1 number sg tense
pres mood ind
estoy
79
Hindi Resources
  • Large statistical lexicon from the Linguistic
    Data Consortium (LDC)
  • Parallel Corpus from LDC
  • Morphological Analyzer-Generator from LDC
  • Lots of native speakers
  • Computational linguists with little or no
    knowledge of Hindi
  • Experimented with the size of the parallel corpus
  • Miserly and large scenarios

80
Hindi
EBMT
Parallel Corpus
SMT
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
15,000 Noun Phrases from Penn TreeBank
Supported by DARPA TIDES
81
Manual Transfer Rules Example
NP PP NP1 NP P Adj N
N1 ke eka aXyAya N
jIvana
NP NP1 PP Adj N
P NP one chapter of N1
N life
NP1 ke NP2 -gt NP2 of NP1 Ex jIvana ke
eka aXyAya life of (one) chapter
gt a chapter of life NP,12 NPNP PP
NP1 -gt NP1 PP ( (X1Y2) (X2Y1) ((x2
lexwx) 'kA') ) NP,13 NPNP NP1 -gt
NP1 ( (X1Y1) ) PP,12 PPPP NP Postp
-gt Prep NP ( (X1Y2) (X2Y1) )
82
Hindi-English
Very miserly training data. Seven combinations of
components Strong decoder allows
re-ordering Three automatic scoring metrics
Write a Comment
User Comments (0)
About PowerShow.com