Title: K.U. Leuven Leuven 20080508
1Morphological Normalizationand Collocation
Extraction
- Jan najder, Bojana Dalbelo Baic, Marko Tadic
- University of Zagreb Faculty of Electrical
Engineering and Computing / Faculty of Humanities
and Social Sciences - jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
marko.tadic_at_ffzg.hr - Seminar at the K. U. Leuven, Department of
Computing ScienceLeuven2008-05-08
2Morphological Normalization
- Jan najder, Marko Tadic
- University of Zagreb Faculty of Electrical
Engineering and Computing / Faculty of Humanities
and Social Sciences - jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
marko.tadic_at_ffzg.hr - Seminar at the K. U. Leuven, Department of
Computing ScienceLeuven2008-05-08
3Talk overview
- who we are?
- what are we doing?
- morphological processing normalization
- lemmatization vs. stemming
- Mollex a system for normalization of Croatian
- usage in document indexing and text
classification - collocations as features
- collocation extraction by co-occurrence measures
- usage of genetic programming
4Who we are?
- University of Zagreb, Croatia
- founded 1669, 52,500 undergraduate students
- two faculties in the same mission
- build the systems that will develop and enable
the usage of language resources and tools for
Croatian
5Who we are 2?
- Faculty of Humanities andSocial Sciences
- Institute / Department ofLinguistics
- dealing with basiccomputational linguistic tasks
for Croatian - compiling and processing large scale language
resources - Croatian National Corpus, Croatian Morphological
Lexicon, Croatian WordNet, Croatian Dependency
Treebank - tagger, lemmatizer
- chunker, parser
- NERC system
6Who we are 3?
- Faculty of Electrical Engineering and Computing
- Department of Electronics, Microelectronics,
Computer and Intelligent Systems / KTLab - Knowledge Technogies Laboratory Group deals with
- text preprocessing techniques for Croatian for
machine learning procedures - dimensionality reduction and document clustering
in the vector space model visualisation - automatic indexing ofdocuments
- intelligent, language specificinformation
retrieval andextraction
7What are we doing?
- working jointly on several research projects
- AIDE Automatic Indexing with Descriptors from
Eurovoc (cooperation with the Government of the
Republic of Croatia, HIDRA) - RMJT Computational Linguistic Models and
Language Technologies for Croatian (national
research programme, two of five projects) - Croatian language resources and their
annotation2007-2011, prof. Marko Tadic - Knowledge discovery in textual data2007-2011,
prof. Bojana Dalbelo Baic - CADIAL Computer Aided Document Indexing for
Accessing Legislation - joint Flemish-Croatian project
- 2007-2009
- prof. Marie-Francine Moens prof. Bojana Dalbelo
Baic
8Morphological processing
- computational linguistic / NLP task
- important for inflectionally rich languages, e.g.
- Croatian noun in 14 word-forms (7 cases, 2
numbers) - N student studenti
- G studenta studenata
- D studentu studentima
- A studenta studente
- V studentu studenti
- L studentu studentima
- I studentom studentima
- unlike English noun in 2(3?) word-forms (2
numbers possesive?) - Sg student Poss (students)
- Pl students
- present in all Slavic languages (excl.
Bulgarian), German, Greek, Baltic languages,
Finnish, ...
9Morphological processing 2
- three basic subtasks in inflection processing
- generation of (all) word-forms (WFs) of a lexeme
- analysis of WFs i.e. recognizing the values of
morphosyntactical categories of a WF in text - recognizing to which lexeme(s) a WF belongs to
- the last one helps us in avoiding the problem of
data sparsness in many text processing tasks,
e.g. - information retrieval, text mining, document
indexing - normalization conflating the morphological
variants of a word to a single representative
form - two main ways to do that
- linguistically motivated lemmatization
- computationally motivated stemming
10Morphological processing 3
- lemmatization
- replacing the WF with its proper base WF, usually
called lemma - e.g. mapping theoretical maximum of (e.g. 14) WFs
to 1 lemma - lexicon based
- large lexicons of all (generated) WFs needed
- preparation expensive in time and manpower
- mostly realized by databases
- algorithmic based
- mostly FST compact, efficient, fast
- lexicon of lemmas and their inflectional patterns
needed anyway
11Morphological processing 4
- stemming
- reducing the WF from the end by truncating the
possible endings - does not have to respect the linguistic
boundaries - vukØ gt vukØ
- vuka gt vuka
- vuce gt vuce
- reducing all the WFs to a common beginning
- problems where there are many morphonological
adaptations - slati gt ?slati
- aljem gt ?aljem
12Morphological normalization
- Croatian language (like most Slavic languages) is
morphologically complex - elaborated inflectional and derivational
morphology - problematic for most NLP applications
- requires the use of substantial linguistic
knowledge - our lexicon based approach to normalization is
somewhere in between lemmatization and stemming - suitable for other inflectionally complex
languages
13Croatian Morphology
- high degree of affixation
- word-forms are obtained by suffixation,
prefixation, phonological alternations, stem
extension - inflection
- nouns declination (7 cases, 2 numbers)?
- verbs conjugation (tenses, persons, numbers,
genders)? - adjectives declination (7 cases, 2 numbers, 3
genders), comparison (3 degrees), and
definiteness - derivation
- a large number of rules for deriving nouns from
verbs, verbs from nouns, possessive adjectives,
...
14Croatian Morphology 2
- inflection examples
- adjective brz, brza, brzi, brzima, brzih, brzoj,
brze, brzim, brzog, brzoga, brz, brza, brzo,
brzom, brzomu, bri, breg, bra, bri, brima,
brih, broj, bre, brim, brem, brima,
najbri, breg, najbra, najbrima, najbrih,
najbre, najbrim, najbri, najbroj, ... - noun brzina, brzinom, brzine, brzinama, brzinu,
brzina, brzini - adjective brzinski, brzinskom, brzinske,
brzinskih, brzinska, brzinskoj, brzinsko,
brzinskog, brzinskoga, - adverb brzo, bre, najbre, brzinski
- derivation examples
- brz gt brzina gt brzinski gt
15Croatian Morphology 3
- high degree of homography
- vode voda (water) voditi (to lead) vod (a
platoon) - requires disambiguation (POS/MSD tagging)?
- affix ambiguity
- many ambiguous suffixation rules
- e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i
- e.g. bolnic-a / bolnic-om vs. brodolom /
brodolom-a - possible mismatches at inflectional level
- narancast / narancast-om vs. ru / ru-om (not
rua) - possible mismatches at derivational level
- e.g. kralj / kralj-ica vs. stan / stan-ica
16Lexicon based normalization
- lexicon-based morphological normalisation
- a morphological lexicon associates to each WF its
morphological norm (lemma, stem,...) and,
optionally, a MSD - incorporates linguistic knowledge and thus avoids
aforementioned pitfalls - drawbacks
- made by linguists, expensive and time-consuming
- problems with coverage (neologisms, jargons, )?
- our approach
- rule-based acquisition of large coverage
morphological lexica from raw (unannotated)
corpora
17Our approach
- acquisition of inflectional lexicon
- input raw corpora and sets of inflectional and
derivational rules in convenient
(grammarbook-like) formalism - normalisation of word-forms
- inflectional (lemmatization)?
- inflectional derivational
- comparable to stemming (but more precise)?
- advantages
- can be used as both a lemmatizer (with MSD) and a
stemmer (with variable degree of conflation)? - provides good lexicon coverage
- requires only limited linguistic expertise
18Morphology representation
- e.g. noun inflectional paradigm
- vojnik (soldier)?
- Case Singular Plural
- N vojnik-Ø vojnic-i
- G vojnik-a vojnik-a
- D vojnik-u vojnic-ima
- A vojnik-a vojnik-e
- V vojnic-e vojnic-i
- L vojnik-u vojnic-ima
- I vojnik-om vojnic-ima
19Morphology representation 2
- defines inflectional and derivational rules
- uses functions as building blocks
- A) condition functions
- B) string transformation functions
- each defined using a higer-order function
- e.g.
- sfx
- sfx('a')
- sfx('a')('vojnik') 'vojnika'
- sfx(e) ? alt(pal)
- (sfx('e') ? alt(pal))('vojnik') 'vojnice'
20Morphology representation 3
- Case Singular Plural
- N vojnik-Ø vojnic-i
- G vojnik-a vojnik-a
- D vojnik-u vojnic-ima
- A vojnik-a vojnik-e
- V vojnic-e vojnic-i
- L vojnik-u vojnic-ima
- I vojnik-om vojnic-ima
- (?s.ends('k','g','h')(s) ? ?consGroup(s),
null, sfx(a), sfx(u), sfx(om), sfx(e) ?
alt(pal), sfx(i) ? alt(sib), sfx(ima) ?
alt(sib), sfx(e))?
21Morphology representation 4
- suitable also for more complex paradigms
(c, null, sfx(a), sfx(u), ..., sfx(ima)
? sfx(og), sfx(om), ..., sfx(ima)
? sfx(i) ? alt(jot), sfx(eg) ? alt(jot),
..., sfx(ima) ? alt(jot) ? sfx(i) ?
alt(jot) ? pfx(naj), ..., sfx(ima) ? alt(jot)
? pfx(naj))
22Morphology representation 5
- advantages
- resembles to morphology description as found in
traditional grammar books - requires minimum amount of linguistic knowledge
- highly expressive arbitrary HOF functions can be
defined - can be aplied to other morphologically similar
languages - implemented in Haskell
- purely functional programming language
- requires minimum programming skills
23Lexicon acquisition
- uses inflectional rules raw corpora to extract
lemmas and their paradigms - uses frequency counts of WFs attested in the
corpus - much of the ambiguity is resolved
bylanguage-dependent heuristics - plausibility, priority
- linguistic quality is not vital
- word-form conflation rather than generation
- human intervention is not required
24Results
- example lexicon
- acquired from 20 Mw newspaper corpus
- based on 90 inflectional and gt300 derivational
rules - contains ca 42,000 lemmas associated with over
500,000 WFs - performance
- linguistic quality F1 88 per type
- coverage 96 per type and 98 per token
- understemming 7
- overstemming lt 4
- can be improved further by manual editing
25Derivational normalization
- inflectional lexicon is partitioned into
equivalence classes based on derivational rules
- degree of normalisation depends on the number of
derivational rules used - problem with semantics
- context, degrees
- derivation is not so semantically regular as
inflection
26References and applications
- Reference
- najder, Jan Dalbelo Baic, Bojana Tadic,
Marko. Automatic Acquisition of Inflectional
Lexica for Morphological Normalisation //
Information Processing and Management, 2008. (in
press) - Applied in document indexing
- projects AIDE CADIAL www.cadial.org
- Dalbelo Baic, Bojana Tadic, Marko Moens,
Marie-Francine. Computer Aided Document Indexing
for Accessing Legislation // Toegang tot de wet /
J. Van Nieuwenhove P. Popelier (eds). Brugge
Die Keure, 2008. pp. 107-117. - Applied in text classification
- Malenica, Mislav muc, Tomislav Jan, najder
Dalbelo Baic, Bojana. Language Morphology
Offset Text Classification on a Croatian-English
Parallel Corpus. // Information Processing and
Management, 44 (2008), 1 325-339.
27- Thank youfor your attention!