K.U. Leuven Leuven 20080508

1 / 27

About This Presentation

Title:

K.U. Leuven Leuven 20080508

Description:

adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, ... adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, ... –

Number of Views:82

Avg rating:3.0/5.0

Slides: 28

Provided by: Jan391

Category:

more less

Transcript and Presenter's Notes

Title: K.U. Leuven Leuven 20080508

1
Morphological Normalizationand Collocation
Extraction

Jan najder, Bojana Dalbelo Baic, Marko Tadic
University of Zagreb Faculty of Electrical
Engineering and Computing / Faculty of Humanities
and Social Sciences
jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
marko.tadic_at_ffzg.hr
Seminar at the K. U. Leuven, Department of
Computing ScienceLeuven2008-05-08

2
Morphological Normalization

Jan najder, Marko Tadic
University of Zagreb Faculty of Electrical
Engineering and Computing / Faculty of Humanities
and Social Sciences
jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
marko.tadic_at_ffzg.hr
Seminar at the K. U. Leuven, Department of
Computing ScienceLeuven2008-05-08

3
Talk overview

who we are?
what are we doing?
morphological processing normalization
lemmatization vs. stemming
Mollex a system for normalization of Croatian
usage in document indexing and text
classification
collocations as features
collocation extraction by co-occurrence measures
usage of genetic programming

4
Who we are?

University of Zagreb, Croatia
founded 1669, 52,500 undergraduate students
two faculties in the same mission
build the systems that will develop and enable
the usage of language resources and tools for
Croatian

5
Who we are 2?

Faculty of Humanities andSocial Sciences
Institute / Department ofLinguistics
dealing with basiccomputational linguistic tasks
for Croatian
compiling and processing large scale language
resources
Croatian National Corpus, Croatian Morphological
Lexicon, Croatian WordNet, Croatian Dependency
Treebank
tagger, lemmatizer
chunker, parser
NERC system

6
Who we are 3?

Faculty of Electrical Engineering and Computing
Department of Electronics, Microelectronics,
Computer and Intelligent Systems / KTLab
Knowledge Technogies Laboratory Group deals with
text preprocessing techniques for Croatian for
machine learning procedures
dimensionality reduction and document clustering
in the vector space model visualisation
automatic indexing ofdocuments
intelligent, language specificinformation
retrieval andextraction

7
What are we doing?

working jointly on several research projects
AIDE Automatic Indexing with Descriptors from
Eurovoc (cooperation with the Government of the
Republic of Croatia, HIDRA)
RMJT Computational Linguistic Models and
Language Technologies for Croatian (national
research programme, two of five projects)
Croatian language resources and their
annotation2007-2011, prof. Marko Tadic
Knowledge discovery in textual data2007-2011,
prof. Bojana Dalbelo Baic
CADIAL Computer Aided Document Indexing for
Accessing Legislation
joint Flemish-Croatian project
2007-2009
prof. Marie-Francine Moens prof. Bojana Dalbelo
Baic

8
Morphological processing

computational linguistic / NLP task
important for inflectionally rich languages, e.g.
Croatian noun in 14 word-forms (7 cases, 2
numbers)
N student studenti
G studenta studenata
D studentu studentima
A studenta studente
V studentu studenti
L studentu studentima
I studentom studentima
unlike English noun in 2(3?) word-forms (2
numbers possesive?)
Sg student Poss (students)
Pl students
present in all Slavic languages (excl.
Bulgarian), German, Greek, Baltic languages,
Finnish, ...

9
Morphological processing 2

three basic subtasks in inflection processing
generation of (all) word-forms (WFs) of a lexeme
analysis of WFs i.e. recognizing the values of
morphosyntactical categories of a WF in text
recognizing to which lexeme(s) a WF belongs to
the last one helps us in avoiding the problem of
data sparsness in many text processing tasks,
e.g.
information retrieval, text mining, document
indexing
normalization conflating the morphological
variants of a word to a single representative
form
two main ways to do that
linguistically motivated lemmatization
computationally motivated stemming

10
Morphological processing 3

lemmatization
replacing the WF with its proper base WF, usually
called lemma
e.g. mapping theoretical maximum of (e.g. 14) WFs
to 1 lemma
lexicon based
large lexicons of all (generated) WFs needed
preparation expensive in time and manpower
mostly realized by databases
algorithmic based
mostly FST compact, efficient, fast
lexicon of lemmas and their inflectional patterns
needed anyway

11
Morphological processing 4

stemming
reducing the WF from the end by truncating the
possible endings
does not have to respect the linguistic
boundaries
vukØ gt vukØ
vuka gt vuka
vuce gt vuce
reducing all the WFs to a common beginning
problems where there are many morphonological
adaptations
slati gt ?slati
aljem gt ?aljem

12
Morphological normalization

Croatian language (like most Slavic languages) is
morphologically complex
elaborated inflectional and derivational
morphology
problematic for most NLP applications
requires the use of substantial linguistic
knowledge
our lexicon based approach to normalization is
somewhere in between lemmatization and stemming
suitable for other inflectionally complex
languages

13
Croatian Morphology

high degree of affixation
word-forms are obtained by suffixation,
prefixation, phonological alternations, stem
extension
inflection
nouns declination (7 cases, 2 numbers)?
verbs conjugation (tenses, persons, numbers,
genders)?
adjectives declination (7 cases, 2 numbers, 3
genders), comparison (3 degrees), and
definiteness
derivation
a large number of rules for deriving nouns from
verbs, verbs from nouns, possessive adjectives,
...

14
Croatian Morphology 2

inflection examples
adjective brz, brza, brzi, brzima, brzih, brzoj,
brze, brzim, brzog, brzoga, brz, brza, brzo,
brzom, brzomu, bri, breg, bra, bri, brima,
brih, broj, bre, brim, brem, brima,
najbri, breg, najbra, najbrima, najbrih,
najbre, najbrim, najbri, najbroj, ...
noun brzina, brzinom, brzine, brzinama, brzinu,
brzina, brzini
adjective brzinski, brzinskom, brzinske,
brzinskih, brzinska, brzinskoj, brzinsko,
brzinskog, brzinskoga,
adverb brzo, bre, najbre, brzinski
derivation examples
brz gt brzina gt brzinski gt

15
Croatian Morphology 3

high degree of homography
vode voda (water) voditi (to lead) vod (a
platoon)
requires disambiguation (POS/MSD tagging)?
affix ambiguity
many ambiguous suffixation rules
e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i
e.g. bolnic-a / bolnic-om vs. brodolom /
brodolom-a
possible mismatches at inflectional level
narancast / narancast-om vs. ru / ru-om (not
rua)
possible mismatches at derivational level
e.g. kralj / kralj-ica vs. stan / stan-ica

16
Lexicon based normalization

lexicon-based morphological normalisation
a morphological lexicon associates to each WF its
morphological norm (lemma, stem,...) and,
optionally, a MSD
incorporates linguistic knowledge and thus avoids
aforementioned pitfalls
drawbacks
made by linguists, expensive and time-consuming
problems with coverage (neologisms, jargons, )?
our approach
rule-based acquisition of large coverage
morphological lexica from raw (unannotated)
corpora

17
Our approach

acquisition of inflectional lexicon
input raw corpora and sets of inflectional and
derivational rules in convenient
(grammarbook-like) formalism
normalisation of word-forms
inflectional (lemmatization)?
inflectional derivational
comparable to stemming (but more precise)?
advantages
can be used as both a lemmatizer (with MSD) and a
stemmer (with variable degree of conflation)?
provides good lexicon coverage
requires only limited linguistic expertise

18
Morphology representation

e.g. noun inflectional paradigm
vojnik (soldier)?
Case Singular Plural
N vojnik-Ø vojnic-i
G vojnik-a vojnik-a
D vojnik-u vojnic-ima
A vojnik-a vojnik-e
V vojnic-e vojnic-i
L vojnik-u vojnic-ima
I vojnik-om vojnic-ima

19
Morphology representation 2

defines inflectional and derivational rules
uses functions as building blocks
A) condition functions
B) string transformation functions
each defined using a higer-order function
e.g.
sfx
sfx('a')
sfx('a')('vojnik') 'vojnika'
sfx(e) ? alt(pal)
(sfx('e') ? alt(pal))('vojnik') 'vojnice'

20
Morphology representation 3

Case Singular Plural
N vojnik-Ø vojnic-i
G vojnik-a vojnik-a
D vojnik-u vojnic-ima
A vojnik-a vojnik-e
V vojnic-e vojnic-i
L vojnik-u vojnic-ima
I vojnik-om vojnic-ima
(?s.ends('k','g','h')(s) ? ?consGroup(s),
null, sfx(a), sfx(u), sfx(om), sfx(e) ?
alt(pal), sfx(i) ? alt(sib), sfx(ima) ?
alt(sib), sfx(e))?

21
Morphology representation 4

suitable also for more complex paradigms

(c, null, sfx(a), sfx(u), ..., sfx(ima)
? sfx(og), sfx(om), ..., sfx(ima)
? sfx(i) ? alt(jot), sfx(eg) ? alt(jot),
..., sfx(ima) ? alt(jot) ? sfx(i) ?
alt(jot) ? pfx(naj), ..., sfx(ima) ? alt(jot)
? pfx(naj))
22
Morphology representation 5

advantages
resembles to morphology description as found in
traditional grammar books
requires minimum amount of linguistic knowledge
highly expressive arbitrary HOF functions can be
defined
can be aplied to other morphologically similar
languages
implemented in Haskell
purely functional programming language
requires minimum programming skills

23
Lexicon acquisition

uses inflectional rules raw corpora to extract
lemmas and their paradigms
uses frequency counts of WFs attested in the
corpus
much of the ambiguity is resolved
bylanguage-dependent heuristics
plausibility, priority
linguistic quality is not vital
word-form conflation rather than generation
human intervention is not required

24
Results

example lexicon
acquired from 20 Mw newspaper corpus
based on 90 inflectional and gt300 derivational
rules
contains ca 42,000 lemmas associated with over
500,000 WFs
performance
linguistic quality F1 88 per type
coverage 96 per type and 98 per token
understemming 7
overstemming lt 4
can be improved further by manual editing

25
Derivational normalization

inflectional lexicon is partitioned into
equivalence classes based on derivational rules

degree of normalisation depends on the number of
derivational rules used
problem with semantics
context, degrees
derivation is not so semantically regular as
inflection

26
References and applications

Reference
najder, Jan Dalbelo Baic, Bojana Tadic,
Marko. Automatic Acquisition of Inflectional
Lexica for Morphological Normalisation //
Information Processing and Management, 2008. (in
press)
Applied in document indexing
projects AIDE CADIAL www.cadial.org
Dalbelo Baic, Bojana Tadic, Marko Moens,
Marie-Francine. Computer Aided Document Indexing
for Accessing Legislation // Toegang tot de wet /
J. Van Nieuwenhove P. Popelier (eds). Brugge
Die Keure, 2008. pp. 107-117.
Applied in text classification
Malenica, Mislav muc, Tomislav Jan, najder
Dalbelo Baic, Bojana. Language Morphology
Offset Text Classification on a Croatian-English
Parallel Corpus. // Information Processing and
Management, 44 (2008), 1 325-339.