Regular Sound Changes for CrossLanguage Information Retrieval - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Regular Sound Changes for CrossLanguage Information Retrieval

Description:

Can cognates be transformed into each other by a predictable series of phonological ... AULEX diccionario Espa ol-Gallego en l nea. Damerau-Levenshtein metric ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 20
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Regular Sound Changes for CrossLanguage Information Retrieval


1
Regular Sound Changes for Cross-Language
Information Retrieval
  • Michael P. Oakes
  • University of Sunderland

2
Cognates
  • Cognates are vocabulary items which
  • occur in two or more historically-related
    languages,
  • have similar orthography,
  • and have similar meanings.
  • abeto / abeto
  • abierto / aberto

3
Rationale
  • Can cognates be transformed into each other by a
    predictable series of phonological changes?
  • Cross-Language Information Retrieval
  • The automatic conversion of query terms in one
    language into their equivalents in the other will
    enable documents in that second language to be
    retrieved by a search engine.

4
Case Study
  • Given a sample word list from two related
    languages, program JAKARTA extracts the probable
    rules for predicting any word of one language
    from that of the other.
  • We will examine a vocabulary list for Spanish and
    Galician.
  • AULEX diccionario Español-Gallego en línea.

5
Damerau-Levenshtein metric
  • Damerau (1964) found that 80 of spelling errors
    in a sample of human keypunched texts were
    single-error misspellings, a single one of the
    following
  • insertion mistyping the as ther
  • deletion mistyping the as th
  • substitution mistyping the as thw
  • transposition mistyping the as hte
  • This suggests the minimum edit method of spelling
    error correction. The minimum edits is the least
    number of insertions, deletions and substitutions
    required to transform one word into another.
  • Exercise Given a dictionary consisting of scarf,
    scare, scene and scent, what is the most likely
    correct spelling of sene?

6
  • This suggests the minimum edit method of spelling
    error correction. The minimum edits is the least
    number of insertions, deletions and substitutions
    required to transform one word into another.
  • Exercise Given a dictionary consisting of scarf,
    scare, scene and scent, what is the most likely
    correct spelling of sene?
  • To find regular sound changes between cognates,
    we need alignment as well as a cost function.

7
Dynamic Programming Insertions, Deletions and
Substitutions.
8
Substitution
  • The most common single character substitution
    (60) between Spanish and Galician was found to be
    j ? x, e.g
  • abadejo / abadexo
  • almeija / ameixa
  • anaranjado / anaranxado
  • berenjena / berenxena
  • bruja / bruxa
  • Next most common u ? o (40, similar vowels), h ?
    f (27, fortition), i ? e (19), g ? x (15).

9
Fortition / Lenition
  • h lt f lt p lt b gt w lt v
  • l lt d
  • r lt s
  • q lt k

10
Insertion / Deletion
  • Called apocope when it occurs at the end of a
    word, e.g. null ? e (35)
  • abad / abade
  • arbol / arbore
  • azucar / azucre
  • canal / canle
  • ciudad / cidade

11
Sound changes involving more than one character
  • JAKARTA looks for other types of sound change
    which commonly occur between related languages
    throughout the world, as listed by Terry Crowley
    (1996)
  • 12, 21, 22, 23, 32
  • These are given lower costs than non-typical
    changes.

12
Vowel Breaking 12
  • A single vowel in word 1 becomes two adjacent
    vowels in word 2. The vowel in word 1 must be the
    same as one of the vowels in word 2.
  • e.g. a ? ai (7)
  • Abajo / abaixo
  • Caja / caixa
  • Debajo / debaixo
  • Jamas / jamais
  • Mas / mais

13
Vowel Breaking 21
  • Two adjacent vowels in word 1 become just one in
    word 2 the vowel in word 2 must be the same as
    one of the vowels in word 1
  • E.g. ua ? a (9)
  • agua / auga
  • cuaderno / caderno
  • cuadrado / cadrado
  • cuadro / cadro
  • cuando / cando

14
Unpacking 21
  • Character 1 in word 1 unpacks to become adjacent
    characters 2a and 2b in word 2, if the number of
    phonetic features common to both (1 and 2a) and
    (1 and 2b) is greater than a predetermined
    threshold.
  • Phonetic features are vocalisation, place, manner
  • E.g. e ? ei (43)
  • Acero / aceiro
  • Aguacatero / aguacateira
  • Babero / babeiro
  • Bandera / bandeira
  • Barbero / barbeiro

15
21 Fusion
  • Characters 1a and 1b (adjacent characters in word
    1) fuse to become the single character 2 in the
    word of the other language if the number of
    phonetic features common to both (1a and 2) and
    (1b and 2) is greater than a predetermined
    threshold.
  • E.g. mb ? m (4)
  • hombre / home (31 ?)
  • lumbre / lume
  • nombre / nome
  • tambien / tamen

16
22 Assimilation
  • The two characters in word 2 have more phonetic
    features in common than those in word 1.
  • e.g. ja ? ll (9)
  • abeja / abella (21 or 32?)
  • aguja /agulla
  • burbuja / burbulla
  • ceja / cella
  • hija / filla

17
Disregard if not Cognate
  • alubia feixon
  • a ? 0 apocope (1)
  • i ? n substitution (2)
  • b ? o substitution (2)
  • u ? x substitution (2)
  • al ? ei assimilation (1)
  • 0 ? f prosthesis (1)
  • Cost 9

18
Collate if Cognate
  • amarillo amarelo
  • o ? o match
  • ll ? l fusion (1)
  • i ? e vowel_similar (1)
  • r ? r match
  • a ? a match
  • m ? m match
  • a ? a match
  • cost 2

19
Summary
  • To translate a Spanish query term into a Galician
    query term for CLIR, find the most similar term
    (or all terms with above threshold similarity) in
    the Galician lexicon
  • Most similar means least edit cost
  • Cost associated with insertion, deletion and
    substitution
  • No cost associated with exact character match
  • No cost associated with j ? x (60), e? ei (43),
    ie ? e (42), u ? o (41), g ? x (35), 0 ? e
    (final)(35), h ? f (27), e ? 0 (medial)(27), ll ?
    l (23), l ? 0 (medial)(21)
Write a Comment
User Comments (0)
About PowerShow.com