Title: With%206,500%20languages%20in%20the%20world,%20%20we%20must%20explore%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20new%20ways%20to%20learn,%20document,%20and%20share%20our%20linguistic%20knowledge.
1With 6,500 languages in the world,we must
explore
new ways to learn, document,
and share our linguistic knowledge.
- John J. Kovarik
- NSA/CSS Senior Language Technology Authority
2Unlocking and Sharing LTCL Linguistic
KnowledgeKeywords CFG parsing, language
generation, computational linguistics
- CALICO 05
- University of Michigan
- Ann Arbor, MI May 17-20, 2005
3The Challenges of Learning and Sharing Knowledge
of an LCTL in the 21st Century
- John J. Kovarik
- National Security Agency
4Presentation Overview
- General LCTL Challenges
- Challenges of Learning Mongolian
- Recipe for New Approach
- Khalka Mongolian Parts of Speech
- Mongolian Morphological Affixes
- Method of Lexical Knowledge Representation
- Analyze, Parse, Build Grammar Model, Test
- Iterate Repeatedly
5LCTL Learning Challenges
- Fewer Learned Resources to Learn from
- Less Recognition Nationally
- Less Opportunities to Document Whats Learned
- Very Few Students to Learn from You
- Almost All Learning Done Manually
- Few Reliable 21st Century Applications
- Microsoft IME
- Font
6Mongolian Learning Challenges
- Input Method Emulator (IME)
- MicroSoft IME
- Keyboard arranged for native Mongols
- American Mongolists prefer phonetic keyboard
- a key on Mongolian keyboard mapped to ASCII a
etc. - Fonts commonly used on Internet
- Russian Cyrillic fonts are commonly used
- and 0 commonly substituted for ? and ?
- ? and ? often freely extended to ? and ?
7Recipe for a New Approach
- Take a student with a computational linguistics
background - Infuse with curiosity and energy
- Stir in access to the Internet
- Add Mongolian syntax and morphology
- Create morphological analyzer, context free
parser, and grammatical generator for Mongolian - Resulting lexicons, software, and grammar models
can be used by other linguistically adept
students
8Khalkha Mongolian Parts of Speech
- Declinable Nouns
- Declinable Adjectives
- Inflected Verbs
- Unchanging Adverbs
- Declinable Converbs
- Unchanging Postpositions
- Unchanging Conjunctions
- Unchanging Particles
9Mongol Morphological Affixes
- 27 verbal suffixes denoting tense and mood
- 2 verb infixes denoting verb manner
- Consultative
- Passive
- 6 verb paradigms or verb types
- 3 irregular common verbs
- 6 cases in singular and plural number
- Both nouns and adjectives are declined
10Lexical Knowledge Representations
- Unchanging adverbs, conjunctions, particles, etc.
and irregular verb forms (unchanging.txt file) - Lemmas of declinable nouns and adjectives
(declinables.txt file) - Inflected verbs and nominalized verbs (regvb.txt
file) - Affix files (casendings.txt, reflex.txt,
infixes.txt, vbforms.txt)
11Some Examples
- declinables.txt file
- N ??? Q ???
- regverb.txt file
- V ?? V ??
- Affix files
- casendings.txt g ??? d ? a ?? b ???
- reflex.txt ?? ?? ??
- infixes.txt C ?? R ?? P ??
- vbforms.txt) ipf ?? i1p ? i3p ??? Ypf ?????
- unchanging.txt file
- Pg-gt?????? Pc-gt????????????
12Merge Morphology Knowledge with the Power of the
Computer
- Wrote yalgah.pl to become tireless lexical
pedagogue - Searches for identifiable affixes by comparison
with lexical knowledge affix files - Matches resulting lemma against lexical knowledge
declinables, verbs, and unchanging words, then
outputs word/part of speech tag to standard
output file plus expository lexicon - Depending whether lemma can or cannot be matched,
outputs - Lemma to Out Of Vocabulary (oov) file noting
affixes found - Word/part of speech tag to standard output file
13Additional Outputs
- Expository Morphology File (named morphlex.txt)
- IR-gtverb command imperative 2nd person singular
- IREEREY-gtconverb future perfect continuative
- IREG-gt verb command concessive 3rd person
singular/plural - BAGA-gtadjective
- HURAL-gtnoun nominative
- IH-gtadjective
- AJILDAA-gtreflexive noun dative-locative
- ORLOO-gtverb indicative second past
- Out Of Vocabulary File (named oov)
- C gt 5 0 E 0 0 A 0 0 (UNKNOWNAHAASAA)
WORD 0 LINE 2 - FALLS OUTSIDE OF VOCABULARY
- possible reflexive ending lt0 0 gt-ltAAgt
- possible declinable case endingltbgt-lt0 0 A gt-ltAASgt
- possible verbal part of speech ltYpf gt-lt0 E gt-ltAHgt
- possible participial/converbal stem ltC gt 5
gt--ltUNKNOWNgt
14Feed Analytic Output to Parser
- Developed context-free grammar (CFG) rules for
both discourse and newspaper texts - S-gtSbj Prd S-gtPrd Sbj-gtNn Sbj-gtNP
- NP-gtTg Nn NP-gtTg Ng Nn Prd-gtJ
- Wrote parse.pl to validate CFG rules against
input text tagged as to part of speech - When each sentence can be fully parsed, outputs a
parse tree and an English gloss. - Working on "BAGA HURAL IH AJILDAA ORLOO ."
- ENGLISH GLOSS large hural great work began .
- The sentence does parse.
- Branch nodes on tree
- S -gt (Sbj Prd)
- Sbj -gt (NP)
- NP -gt (J Nn)
- Prd -gt (NPd Vi2p)
- NPd -gt (J Nd)
- POS J Nn J Nd Vi2p
15Feed Output to Generator
- Wrote gramgen.pl to generate sentences based on
lexical knowledge, morphological knowledge, and
syntactic knowledge gained - Output routinely reviewed for accuracy and
Chomskian explanatory adequacy of the grammar
models created for the parser and generator
engines
16Iterative Process
- First take new newspaper article or dialogue and
run morphological analyzer on it until all words
are listed within vocabulary (no output in the
oov Out Of Vocabulary file - Run output through parser, creating new CFG rules
until new text parses - Run generator for a hundred or more examples to
ensure adequacy of new rules
17Morpho-analyzer, Parser, GeneratorSoftware Led
This Student to Deeper Understanding of Mongolian
- A linguistically adept learner can thus write
software to help one learn deeper faster - Language tool development is thus grounded in
gaining and applying language knowledge in a
systematic and linguistically principled manner
for oneself and others
18Contact Information
- John Kovarik
- Email kovarik_at_afterlife.ncsc.mil
- Home Page http//www.worldnet.att/kovariks
- Phone 443-479-7188