Semitic Languages, Linguistics and Computers - PowerPoint PPT Presentation

About This Presentation
Title:

Semitic Languages, Linguistics and Computers

Description:

The Challenge of Fixed-length Reduplication in Tagalog (Antworth 1990:156-162) ... About 4930 roots in the underlying dictionary ... – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 33
Provided by: Nik35
Category:

less

Transcript and Presenter's Notes

Title: Semitic Languages, Linguistics and Computers


1
Semitic Languages, Linguistics and Computers
Kenneth R. BEESLEY Xerox Research Centre Europe
(XRCE) ken.beesley_at_xrce.xerox.com University
of Malta March 2001
2
Ken Beesley Brief Introduction
  • B.A., Linguistics and Computer Science, Brigham
    Young University, 1978
  • Diploma, Linguistics and Phonetics, Univ. of
    Glasgow, 1979
  • D.Phil., Epistemics (Cognitive Science), Univ.
    of Edinburgh, 1983
  • ALPNET, computer assisted translation, 1984-1990
  • 1988-1990 Arabic morphology project, exposure to
    Finite-State Morphology (Two-Level Morphology)
    from Lauri Karttunen at COLING 1988
  • Microlytics (Xerox spinoff), 1990-1993
  • Xerox Corporation 1993-present
  • Computational Morphology projects Arabic,
    Spanish, Portuguese, Italian, Dutch, (Malay),
    (Aymara) also teaching finite-state programming
    techniques
  • Some people are into finite-state programming for
    the mathematics and algorithms Im in it because
    it lets me build working systems for interesting
    natural languages.

3
Overview of Todays Talk
  • Formal Morphology
  • Morphotacticsstudy and description of word
    formation
  • Morphophonologystudy and description of
    alternations
  • Challenges/Issues in Semitic morphology
  • Computational Morphology (Finite-State Morphology
    paradigm)
  • General challenges/successes around the world
  • Semitic languagesalways seem to be a bit harder
  • Significant computational work already done on
    Semitic languages
  • Hope to inspire more

4
Concatenative-Polysynthetic (Inuktitut)
  • Lexical natsiqviniqtuqlauqsimavitli
  • Surface natsi viniq tu lauq si ma vi l li
  • natsiq seal (open-class stem)
  • viniq meat (closed-class substem)
  • tuq eat (closed-class substem)
  • lauq before
  • si perfective
  • ma resulting state
  • vi question marker
  • t you
  • li but

(but) have you ever eaten seal meat before?
5
Inuktitut
  • Parismutnngaujumaniraqlauqsimanngitjunga
  • Pari mu nngau juma nira lauq si ma nngit tunga
  • Paris Paris
  • mut terminalis-case
  • nngau direction-to
  • juma want
  • niraq declare that
  • lauq past
  • si perfective
  • ma resulting state
  • nngit negative
  • junga 1P pres. indic

I never said that I wanted to go to Paris
6
Concatenative-Agglutinative (Aymara)
  • Lexical utamana-kapxaraki-iwa
  • Surface uta ma n ka p xa rak i wa
  • uta house (noun stem)
  • ma 2nd person possessive (your)
  • na in (case suffix)
  • -ka locative (also verbalizes)
  • p plural
  • xa perfect aspect
  • raki also
  • -i 3rd person present tense
  • wa affirmative sentencial
  • also they are in your house

7
Aymara
  • Morphophonemic Chuñu wi na -ka si -ka
    -iri yat(a) wa
  • Surface Chuñü wi n ka s
    k irï yät wa
  • chuñu N freeze-dried potatoes
  • NgtV be/make
  • wi VgtN place-of
  • na in (location)
  • -ka NgtV be-in (location)
  • si continuative
  • -ka imperfect
  • -iri VgtN one who
  • NgtV be
  • yata 1P recent past
  • wa affirmative sentencial
  • I was (one who was) always at the place for
    making chuñu

8
Theory-Neutral Morphological Analysis
Analyses
Analysis undoes the morphotactic and
morphophonological processes, separating and
identifying the morphemes
Generation is ideally just the inverse of
analysis.
Black-Box Morphological Analyzer
Words
Chuñüwinkaskirïyätwa
9
The Claim/Goal of Xerox Finite-State Morphology
  • Both the morphotactics and the morphophonological
    alternations can be described with regular
    expressions, or equivalent shorthand notations,
    which are compiled into finite-state transducers
    (networks)

Combine via Composition at compile-time
Morphotactic Description (regular expression or
lexc)
FST
.o.
gt
FST
Compiler
Alternation Rules (regular expression)
FST
Lexical Transducer
10
A Properly Defined Lexical Transducer FST can
Perform Morphological Analysis and Generation
  • bidirectional
  • same network for both analysis and generation
  • efficient
  • process thousands of words/second
  • compact
  • less than 1MB in compressed form

vouloirIndPSGP3
Finite-state network
veut
canonical form
inflection codes
inflected form
11
Why is finite-state power interesting?
  • Formally constrained (not just a bunch of ad hoc
    code)
  • Flexiblegrammars compile into finite-state
    automata (networks) that can themselves be
    combined and modified without needing to change
    the original grammar
  • Networks provide efficient storage
  • Networks can be applied very efficientlymorphol
    ogical analyzers typically run at thousands of
    words per second on modern machines
  • Networks are bi-directional
  • The application code is language-independent

12
Some Aymara alternation rules
  • a -gt ä, i -gt ï, u -gt ü _
  • a i u -gt 0 _ -
  • c h i - -gt s _ t s
  • s t ä (-gt) t ä s k i _ t a

You can see and download the set of real Aymara
alternation rules at http//www.xrce.xerox.com/res
earch/mltt/aymara
13
Finite-State Morphology
  • Software ImplementationsDevelopment Environments
  • Two-Level Morphology (e.g. PC-KIMMO)
  • Xerox Finite-State Morphology (lexc, xfst, twolc,
    )
  • ATT Library, Lextools
  • Univ. of Groningen, Fsa Utils 6
  • Morphological Applications
  • All the commercially interesting Indo-European
    languages
  • Also Finnish, Hungarian, Turkish, Swahili,
    Korean, Japanese
  • Significant research in Irish, Basque, Malay,
    Aymara,

14
Criticism of Traditional Finite-State
Morphotactics
  • Two-Level and Finite-State Morphology in general
    have been widely criticized for handling only
    concatenative morphotactics.
  • Only restricted infixation and reduplication can
    be handled adequately with the present system.
    Some extensions or revisions will be necessary
    for an adequate description of languages
    possessing extensive infixation or
    reduplication. (Koskenniemi, 1983, p. 27)
  • In particular, it is often charged that
    finite-state morphology is not capable of
    handling Semitic languages.

15
The Challenge of Fixed-length Reduplication in
Tagalog (Antworth 1990156-162)
  • pili choose gt pipili
  • tahi sew gt tatahi
  • kuha take gt kukuha
  • Antworth defines a morphophonemic lexical prefix
    RE plus alternation rules that realize R as the
    first following consonant and E as the first
    following vowel.
  • Lexical REpili REtahi
    REkuha
  • Surface p i pili t a tahi
    k u kuha
  • Thus solution is adequate and even elegant for
    such fixed-length reduplication.

16
Challenge Malay/Indonesian Full-Stem
Reduplication
  • Simple reduplication bukuredup Stem buku
    (book)
  • buku-buku books
  • Prefixed reduplication bagimeNredup Stem
    bagi (divide)
  • membagi-bagi divide into separate parts
  • pijitmeNredup Stem pijit (get a
    massage)
  • memijit-mijit squeeze
  • Redup, prefix-suffix merahkeredupan Stem
    merah (red)
  • kemerah-merahan reddish
  • Prefix-suffix, redup ubahredupperan Stem
    ubah (difference)
  • perubahan-perubahan alternations/changes

17
The Xerox compile-replace algorithm
  • An algorithm that takes a finite-state network as
    an argument and returns a modified (still
    finite-state) network
  • Can be applied to the upper-side and/or the
    lower-side of a network, perhaps multiple times.
  • compile-replace
  • finds delimited substrings of the form
    string , where the string is just a string of
    symbols, joined by concatenation, but which
    happens to have the format of a regular
    expression
  • compiles the string as a regular expression, and
    then
  • replaces the delimited substring with the result
    of the compilation.

18
The (Xerox) finite-state iteration operator
  • n n concatenations, for any integer n
  • A2 denotes two concatenations of the language A
    with itself, equivalent to A A.
  • A bagi
  • A2 bagibagi
  • Finite-state languages and relations are closed
    under n-ary concatenation.

19
Iteration in Morphotactics Malay
  • define pref 0 .x.
  • define root b a g i p e r a t u r a n
  • define suff Noun0 Pl .x.
    2

  • Sg .x. 0
  • define Nouns (pref) root suff
  • The resulting intermediate FST will relate string
    pairs like the following
  • (we filter out strings with unmatched delimiters
    and )
  • Upper bagiNounSg
    0 0bagiNounPl
  • Lower bagi0 0
    bagi 0 2

20
compile-replace before and after
  • Upper bagiNounPl
    peraturanNounPl
  • Lower bagi2
    peraturan2
  • xfst compile-replace lower
  • Upper bagiNounPl
    peraturanNounPl
  • Lower bagibagi
    peraturanperaturan

Before
After
And it applies similarly to all delimited
regular-expression substrings on the lower side.
There must be a finite number of them. Note that
this operation is performed just once at
compile-time.
21
Another Challenge Arabic Stem Interdigitation
  • wasayaktubuwnahaA
  • wa and
  • sa future marker
  • ya imperfect prefix
  • ktb root k t b
  • CCVC Form I imperfect template CCVC ? ktub
    (stem)
  • u Active-voice vocalization u
  • una they masc. Plural (imperfect suffix)
  • ha it/them (direct-object clitic pronoun
    suffix)
  • English gloss and they will write it

Stem Interdigitation
22
Some Formal Analyses of Semitic Stems
  • Harris, 1944 b r k t
    b
  • n_a_i_
    _a_a_
  • nabir
    katab
  • McCarthy, 1981 n

  • b r k t b
  • CCVCVC CVCVC

  • a i a
  • nabir katab

Root-Pattern
Root-Template-Vocalization
Another alternative is simply to ignore or deny
the concept of roots and treat stems as
monolithic morphemes.
23
Finite-State Computational Semitic
  • Kay, 1987 Arabic stem interdigitation via
    multi-level transducers (Kiraz, 2000)
  • Lavie et al., 1988 Two-Level Morphology adapted
    to Hebrew verbs
  • Kataja Koskenniemi, 1988 Ancient
    Akkadian
  • Concatenating languages are just a special case
  • Morphotactics defined using regular
    expressions/operations
  • Roots and patterns formalized as regular
    languages
  • Roots are INTERSECTED with patterns, rather than
    concatenated, to form stems
  • Sublexicon of Roots Sublexicon of
    Patterns
  • ? k ? t ? b ?
    CaCaC
  • Pre-intersected by awk scripts
  • katab
  • Then compiled by TwoL

24
Beesley Arabic Stem Intersection at Runtime
  • ALPNET (88-90) k t b

  • wasayaCCuCunaha
  • Roots and patterns resided in separate
    sublexicons
  • Root and pattern sublexicons were traversed in
    parallel at runtime
  • Intersection was simulated in C code
    (detouring) at runtime
  • ktb and CCuC were returned as separate morphemes
    in the analyses
  • Still mostly a Two-Level System
  • Xerox (1996-98) Reimplementation using
    Xerox Finite-State Morphology
  • On-line demo available http//www.xrce.xerox.co
    m/research/mltt/arabic
  • Use any Java-enabled browser

Beesley Stem Intersection at Compile-time
25
Xerox Arabic Morphological Analyzer
  • About 4930 roots in the underlying dictionary
  • Each root is encoded to show which patterns it
    can combine with
  • Roots and patterns are intersected to form over
    90,000 stems
  • With various combinations of prefixes and
    suffixes, the system encodes 72,000,000
    fully-voweled words, with their morphological
    analyses
  • In addition, it analyzes unvoweled and partially
    voweled spellings
  • The compiled analyzer network is currently
    storable in about 5 MB
  • The web demo is Unicode based and renders Arabic
    script as you type
  • Roots, patterns and other affixes are separated
    and returned

26
Intersecting Stems on One Side of a Transducer at
Compile Time
  • Start with a Two-Level Lexicon
  • Compose FS Intersecting Rules at Compile Time
  • Upper wasayaktb CCuCunaha
  • Lower wasayaktb CCuCunaha
  • .o.
  • Finite-State Stem-Intersection Rules
  • Result
  • Upper wasayaktb CCuCunaha
  • Lower wasaya ktub
    unaha
  • Then apply the finite-state morphophonological
    alternation/realization rules, handling weak
    roots, hamza orthography in general,
    assimilation, deletion,

27
Finite-State Merge fast special-case intersection
  • .mgt. is the merge to the right operator and
  • .ltm. is the merge to the left operator
  • ktb .mgt. CVVCVC .ltm. a gt kaatab
  • ktb .mgt. CVVCVC .ltm. ui gt kuutib

28
The compile-replace algorithm before and after
  • Upper ktb.mgt.CVVCVC.ltm.uia
  • Lower ktb.mgt.CVVCVC.ltm.uia
  • xfst list C k t b d r s m n
  • xfst list V a i u
  • xfst compile-replace lower
  • Upper ktb.mgt.CVVCVC.ltm.uia
  • Lower kuutib
    a

Before
After
and similarly for about 90,000 stems
29
The compile-replace algorithm
  • A general compile-time technique that allows the
    regular-expression compiler to apply to and
    modify its own output.
  • Somewhat similar in operation to eval in LISP
    and Perl.
  • Appears to handle some classic examples of
    non-concatenative morphotactics full-stem
    reduplication and Semitic stem interdigitation,
    either
  • Two-way root-pattern theory, or
  • Three-way root-template-vocalization theory
  • Weve only begun to explore the possibilities.

30
What is Finite-State Computing Good For?
  • Mostly lower-level natural language processing
  • Tokenization
  • Spelling checking/correction
  • Phonology
  • Morphological Analysis/Generation
  • Part-of-Speech Tagging
  • Shallow Syntactic Parsing and Chunking

Finite-state techniques cannot do everything but
for tasks where they do apply, they are extremely
attractive.
31
What about Maltese?
  • Necessary preliminary work has already started
  • Corpora
  • Lexicography
  • Formal linguistic description
  • Finite-state implementation
  • Xerox finite-state calculus already licensed at
    Univ. of Malta
  • The compile-replace algorithm will soon be
    released
  • The Book (Beesley and Karttunen, forthcoming)
  • Unique opportunity
  • Semitic component
  • Routinely written, in a culture with high literacy

32
Final Observations
  • Successful computational linguistic projects are
    often the result of cooperation between a
    computational linguist and a more traditional
    descriptive linguist
  • Computational linguistics can be commercially
    rewardng
  • Computational linguistics is a healthy discipline
    from the descriptive point of view
  • Your grammars can literally be tested on millions
    of words
  • Any mistakes or gaps in your grammars soon become
    apparent
Write a Comment
User Comments (0)
About PowerShow.com