Automating machine translation from poorly studied languages - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Automating machine translation from poorly studied languages

Description:

Automating machine translation from poorly studied languages. John ... integrating Egypt statistical machine translation into our package for easy application ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 38
Provided by: johngol
Learn more at: https://hum.uchicago.edu
Category:

less

Transcript and Presenter's Notes

Title: Automating machine translation from poorly studied languages


1
Automating machine translation from poorly
studied languages
  • John Goldsmith
  • Departments of Linguistics and Computer Science

2
Outline
  • The goal of automatic translation
  • The history of automatic translation
  • From the cybernetics era (1948 1960)
  • To the statistical era (1993 date)
  • The problem of complex word structure in most
    languages
  • and our solution
  • Where we stand today

3
1. The goal of automatic translation
  • There are over 6,000 human languages in use
    today.
  • Researchers need access to documents in all of
    these languages over the long run.
  • Defense analysts may need access to documents in
    any of these languages with pressing time needs.
  • Languages can be intentionally used as encryption
    systems.

4
6,000 natural languages of the world
5
Just the major languages of Africa alone
6
Computational linguistic research
  • When we specify the problem that we tackle, they
    often sound super-humanly difficult.
  • When we begin to explain the methods, they can
    sound far too simple.
  • In fact, the methods are conceptually elegant,
    and highly quantitative.
  • The goals of linguistics, with the tools of
    computer science.

7
2. History of MT
  • A reminder of where computers actually came from
  • World War II, and their uses
  • more accurately aim artillery
  • efforts to break German encryption systems
  • Post-war period of industrialization

8
Warren Weavers memo (July 1949)
  • Director, Natural Sciences Division of the
    Rockefeller Foundation
  • It is very tempting to say that a book written in
    Chinese is simply a book written in English which
    was coded into the "Chinese code." If we have
    useful methods for solving almost any
    cryptographic problem, may it not be that with
    proper interpretation we already have useful
    methods for translation?

9
Efforts in the 1950s
  • ...stymied by
  • the lack of sufficient computing power,
  • and immature computing technology

10
Example
They hadnt reckoned on ambiguity when they set
out to translate human languages.
January, 1954
11
  • Progress during the 1970s and 1980s was
    incremental.
  • In the 1990s, a major sea-change in computational
    linguistics occurred, based on data-driven
    statistical techniques.
  • IBM Research developed an approach to translation
    based on systems that learn from examples.

12
Statistical Machine Translation (MT)
13
1999 The Egypt system
  • NSF funded a summer project at Johns Hopkins
    University Egypt.
  • Open source and widely used in research.
  • Difficult to use in practice.

14
What do we translate?
  • Do we translate sentences?
  • In a sense, yes.
  • lentourage de Chirac est plus imperméable que
    celui de Nicolas Sarkozy.
  • Chiracs inner circle is more tightly knit than
    that of Nicolas Sarkozy.

15
Sentences W C
  • A sentence is a collection of words and
    constructions.
  • We translate the words and the constructions.
  • We will break the problem down into these two
    parts, then.

16
Word-level alignments
Given a parallel sentence pair we can link
(align) words or phrases that are translations of
each other
Le chien se est assis sur le tapis
System is given 2 sentences, but without any
information about how the words are
aligned these lines are inferred, not given.
the dog sat down on the rug
17
MT first two tasks
  • Figure out word-to-word matchings (translations)
  • Figure out common alignments across the source
    and target languages how their word orders
    differ.
  • French and English quite similar
  • Japanese, Korean verb appears at the end of the
    sentence.

18
Just a taste
  • This is our corpus

19
NULL?
We often find that a word in one language
corresponds to nothing in the other language so
we include NULL as an ever-present possibility
of translation.
20
NULL the dog le chien j1 (le) total P(le
NULL)P(le the)P(le dog) 2/3 ½ 7/6
1.17
tctotal count tc(ab) total expected count of
this joint occurrence
21
Changes in probabilities
Initialized values
Iteration 2
After 5 iterations
22
3. What is morphology?
  • Morphology studies the internal structure of
    words
  • English words word s
  • findings find ing s
  • - Swahili tunakisema we speak it

23
European languages are outliers.
  • From the morphological point of view, most
    languages of the world are much more complex than
    European languages.

24
Linguistica
  • Computational linguistic project under
    development since 1997
  • http//linguistica.uchicago.edu
  • Core engine automatic morphology analyzer
  • Learns the morphological structure of a language
    directly from a (written) sample, with no human
    intervention.

25
English illustration
  • Bear in mind the system has no initial knowledge
    at all about English.
  • It takes about 15 seconds to analyze 200,000
    words of English.
  • C code is highly optimized, and operates 2
    orders of magnitude faster than other comparable
    computational linguistic systems.

26
Signatures
Adjectives
Verbs
We find these automatically
Nouns
27
Compounds
  • English makes heavy use of compounds, which are
    best handled if we can break them apart
  • Eastward
  • eggshell
  • farmhouse
  • headdress

28
Compounds
Selected
Rejected
Rejected
29
4. Where we stand today
  • Our project is working on
  • improving automatic morphology
  • integrating Egypt statistical machine translation
    into our package for easy application
  • improving translation by using morphology
  • testing with Swahili-English

30
4.1 Improving automatic morphology
  • Swahili, Somali, Urdu, Finnish
  • Compounds English, Finnish

31
Swahili
nilimupenda nitakamupenda
32
(No Transcript)
33
Swahili verb
34
4.2 Integrating Egypt MT software into our
front-end
  • Linguistica has a user-friendly front end
  • Linguistica is written in C, compiles under
    Windows, MacOS, and Linux
  • Open source

35
4.3 Improving translations using morphology
  • Developing mathematical models
  • A small amount of work has been done by other
    researchers, but the goal has largely been to use
    morphology to strip off affixes.

36
4.4 Testing with Swahili
  • 8 books from the New Testament available on the
    internet.

37
the end
Write a Comment
User Comments (0)
About PowerShow.com