K.U. Leuven Leuven 20080508 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

K.U. Leuven Leuven 20080508

Description:

adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, ... adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 28
Provided by: Jan391
Category:
Tags: adjective | leuven

less

Transcript and Presenter's Notes

Title: K.U. Leuven Leuven 20080508


1
Morphological Normalizationand Collocation
Extraction
  • Jan najder, Bojana Dalbelo Baic, Marko Tadic
  • University of Zagreb Faculty of Electrical
    Engineering and Computing / Faculty of Humanities
    and Social Sciences
  • jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
    marko.tadic_at_ffzg.hr
  • Seminar at the K. U. Leuven, Department of
    Computing ScienceLeuven2008-05-08

2
Morphological Normalization
  • Jan najder, Marko Tadic
  • University of Zagreb Faculty of Electrical
    Engineering and Computing / Faculty of Humanities
    and Social Sciences
  • jan.snajder_at_fer.hr, bojana.dalbelo_at_fer.hr,
    marko.tadic_at_ffzg.hr
  • Seminar at the K. U. Leuven, Department of
    Computing ScienceLeuven2008-05-08

3
Talk overview
  • who we are?
  • what are we doing?
  • morphological processing normalization
  • lemmatization vs. stemming
  • Mollex a system for normalization of Croatian
  • usage in document indexing and text
    classification
  • collocations as features
  • collocation extraction by co-occurrence measures
  • usage of genetic programming

4
Who we are?
  • University of Zagreb, Croatia
  • founded 1669, 52,500 undergraduate students
  • two faculties in the same mission
  • build the systems that will develop and enable
    the usage of language resources and tools for
    Croatian

5
Who we are 2?
  • Faculty of Humanities andSocial Sciences
  • Institute / Department ofLinguistics
  • dealing with basiccomputational linguistic tasks
    for Croatian
  • compiling and processing large scale language
    resources
  • Croatian National Corpus, Croatian Morphological
    Lexicon, Croatian WordNet, Croatian Dependency
    Treebank
  • tagger, lemmatizer
  • chunker, parser
  • NERC system

6
Who we are 3?
  • Faculty of Electrical Engineering and Computing
  • Department of Electronics, Microelectronics,
    Computer and Intelligent Systems / KTLab
  • Knowledge Technogies Laboratory Group deals with
  • text preprocessing techniques for Croatian for
    machine learning procedures
  • dimensionality reduction and document clustering
    in the vector space model visualisation
  • automatic indexing ofdocuments
  • intelligent, language specificinformation
    retrieval andextraction

7
What are we doing?
  • working jointly on several research projects
  • AIDE Automatic Indexing with Descriptors from
    Eurovoc (cooperation with the Government of the
    Republic of Croatia, HIDRA)
  • RMJT Computational Linguistic Models and
    Language Technologies for Croatian (national
    research programme, two of five projects)
  • Croatian language resources and their
    annotation2007-2011, prof. Marko Tadic
  • Knowledge discovery in textual data2007-2011,
    prof. Bojana Dalbelo Baic
  • CADIAL Computer Aided Document Indexing for
    Accessing Legislation
  • joint Flemish-Croatian project
  • 2007-2009
  • prof. Marie-Francine Moens prof. Bojana Dalbelo
    Baic

8
Morphological processing
  • computational linguistic / NLP task
  • important for inflectionally rich languages, e.g.
  • Croatian noun in 14 word-forms (7 cases, 2
    numbers)
  • N student studenti
  • G studenta studenata
  • D studentu studentima
  • A studenta studente
  • V studentu studenti
  • L studentu studentima
  • I studentom studentima
  • unlike English noun in 2(3?) word-forms (2
    numbers possesive?)
  • Sg student Poss (students)
  • Pl students
  • present in all Slavic languages (excl.
    Bulgarian), German, Greek, Baltic languages,
    Finnish, ...

9
Morphological processing 2
  • three basic subtasks in inflection processing
  • generation of (all) word-forms (WFs) of a lexeme
  • analysis of WFs i.e. recognizing the values of
    morphosyntactical categories of a WF in text
  • recognizing to which lexeme(s) a WF belongs to
  • the last one helps us in avoiding the problem of
    data sparsness in many text processing tasks,
    e.g.
  • information retrieval, text mining, document
    indexing
  • normalization conflating the morphological
    variants of a word to a single representative
    form
  • two main ways to do that
  • linguistically motivated lemmatization
  • computationally motivated stemming

10
Morphological processing 3
  • lemmatization
  • replacing the WF with its proper base WF, usually
    called lemma
  • e.g. mapping theoretical maximum of (e.g. 14) WFs
    to 1 lemma
  • lexicon based
  • large lexicons of all (generated) WFs needed
  • preparation expensive in time and manpower
  • mostly realized by databases
  • algorithmic based
  • mostly FST compact, efficient, fast
  • lexicon of lemmas and their inflectional patterns
    needed anyway

11
Morphological processing 4
  • stemming
  • reducing the WF from the end by truncating the
    possible endings
  • does not have to respect the linguistic
    boundaries
  • vukØ gt vukØ
  • vuka gt vuka
  • vuce gt vuce
  • reducing all the WFs to a common beginning
  • problems where there are many morphonological
    adaptations
  • slati gt ?slati
  • aljem gt ?aljem

12
Morphological normalization
  • Croatian language (like most Slavic languages) is
    morphologically complex
  • elaborated inflectional and derivational
    morphology
  • problematic for most NLP applications
  • requires the use of substantial linguistic
    knowledge
  • our lexicon based approach to normalization is
    somewhere in between lemmatization and stemming
  • suitable for other inflectionally complex
    languages

13
Croatian Morphology
  • high degree of affixation
  • word-forms are obtained by suffixation,
    prefixation, phonological alternations, stem
    extension
  • inflection
  • nouns declination (7 cases, 2 numbers)?
  • verbs conjugation (tenses, persons, numbers,
    genders)?
  • adjectives declination (7 cases, 2 numbers, 3
    genders), comparison (3 degrees), and
    definiteness
  • derivation
  • a large number of rules for deriving nouns from
    verbs, verbs from nouns, possessive adjectives,
    ...

14
Croatian Morphology 2
  • inflection examples
  • adjective brz, brza, brzi, brzima, brzih, brzoj,
    brze, brzim, brzog, brzoga, brz, brza, brzo,
    brzom, brzomu, bri, breg, bra, bri, brima,
    brih, broj, bre, brim, brem, brima,
    najbri, breg, najbra, najbrima, najbrih,
    najbre, najbrim, najbri, najbroj, ...
  • noun brzina, brzinom, brzine, brzinama, brzinu,
    brzina, brzini
  • adjective brzinski, brzinskom, brzinske,
    brzinskih, brzinska, brzinskoj, brzinsko,
    brzinskog, brzinskoga,
  • adverb brzo, bre, najbre, brzinski
  • derivation examples
  • brz gt brzina gt brzinski gt

15
Croatian Morphology 3
  • high degree of homography
  • vode voda (water) voditi (to lead) vod (a
    platoon)
  • requires disambiguation (POS/MSD tagging)?
  • affix ambiguity
  • many ambiguous suffixation rules
  • e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i
  • e.g. bolnic-a / bolnic-om vs. brodolom /
    brodolom-a
  • possible mismatches at inflectional level
  • narancast / narancast-om vs. ru / ru-om (not
    rua)
  • possible mismatches at derivational level
  • e.g. kralj / kralj-ica vs. stan / stan-ica

16
Lexicon based normalization
  • lexicon-based morphological normalisation
  • a morphological lexicon associates to each WF its
    morphological norm (lemma, stem,...) and,
    optionally, a MSD
  • incorporates linguistic knowledge and thus avoids
    aforementioned pitfalls
  • drawbacks
  • made by linguists, expensive and time-consuming
  • problems with coverage (neologisms, jargons, )?
  • our approach
  • rule-based acquisition of large coverage
    morphological lexica from raw (unannotated)
    corpora

17
Our approach
  • acquisition of inflectional lexicon
  • input raw corpora and sets of inflectional and
    derivational rules in convenient
    (grammarbook-like) formalism
  • normalisation of word-forms
  • inflectional (lemmatization)?
  • inflectional derivational
  • comparable to stemming (but more precise)?
  • advantages
  • can be used as both a lemmatizer (with MSD) and a
    stemmer (with variable degree of conflation)?
  • provides good lexicon coverage
  • requires only limited linguistic expertise

18
Morphology representation
  • e.g. noun inflectional paradigm
  • vojnik (soldier)?
  • Case Singular Plural
  • N vojnik-Ø vojnic-i
  • G vojnik-a vojnik-a
  • D vojnik-u vojnic-ima
  • A vojnik-a vojnik-e
  • V vojnic-e vojnic-i
  • L vojnik-u vojnic-ima
  • I vojnik-om vojnic-ima

19
Morphology representation 2
  • defines inflectional and derivational rules
  • uses functions as building blocks
  • A) condition functions
  • B) string transformation functions
  • each defined using a higer-order function
  • e.g.
  • sfx
  • sfx('a')
  • sfx('a')('vojnik') 'vojnika'
  • sfx(e) ? alt(pal)
  • (sfx('e') ? alt(pal))('vojnik') 'vojnice'

20
Morphology representation 3
  • Case Singular Plural
  • N vojnik-Ø vojnic-i
  • G vojnik-a vojnik-a
  • D vojnik-u vojnic-ima
  • A vojnik-a vojnik-e
  • V vojnic-e vojnic-i
  • L vojnik-u vojnic-ima
  • I vojnik-om vojnic-ima
  • (?s.ends('k','g','h')(s) ? ?consGroup(s),
    null, sfx(a), sfx(u), sfx(om), sfx(e) ?
    alt(pal), sfx(i) ? alt(sib), sfx(ima) ?
    alt(sib), sfx(e))?

21
Morphology representation 4
  • suitable also for more complex paradigms

(c, null, sfx(a), sfx(u), ..., sfx(ima)
? sfx(og), sfx(om), ..., sfx(ima)
? sfx(i) ? alt(jot), sfx(eg) ? alt(jot),
..., sfx(ima) ? alt(jot) ? sfx(i) ?
alt(jot) ? pfx(naj), ..., sfx(ima) ? alt(jot)
? pfx(naj))
22
Morphology representation 5
  • advantages
  • resembles to morphology description as found in
    traditional grammar books
  • requires minimum amount of linguistic knowledge
  • highly expressive arbitrary HOF functions can be
    defined
  • can be aplied to other morphologically similar
    languages
  • implemented in Haskell
  • purely functional programming language
  • requires minimum programming skills

23
Lexicon acquisition
  • uses inflectional rules raw corpora to extract
    lemmas and their paradigms
  • uses frequency counts of WFs attested in the
    corpus
  • much of the ambiguity is resolved
    bylanguage-dependent heuristics
  • plausibility, priority
  • linguistic quality is not vital
  • word-form conflation rather than generation
  • human intervention is not required

24
Results
  • example lexicon
  • acquired from 20 Mw newspaper corpus
  • based on 90 inflectional and gt300 derivational
    rules
  • contains ca 42,000 lemmas associated with over
    500,000 WFs
  • performance
  • linguistic quality F1 88 per type
  • coverage 96 per type and 98 per token
  • understemming 7
  • overstemming lt 4
  • can be improved further by manual editing

25
Derivational normalization
  • inflectional lexicon is partitioned into
    equivalence classes based on derivational rules
  • degree of normalisation depends on the number of
    derivational rules used
  • problem with semantics
  • context, degrees
  • derivation is not so semantically regular as
    inflection

26
References and applications
  • Reference
  • najder, Jan Dalbelo Baic, Bojana Tadic,
    Marko. Automatic Acquisition of Inflectional
    Lexica for Morphological Normalisation //
    Information Processing and Management, 2008. (in
    press)
  • Applied in document indexing
  • projects AIDE CADIAL www.cadial.org
  • Dalbelo Baic, Bojana Tadic, Marko Moens,
    Marie-Francine. Computer Aided Document Indexing
    for Accessing Legislation // Toegang tot de wet /
    J. Van Nieuwenhove P. Popelier (eds). Brugge
    Die Keure, 2008. pp. 107-117.
  • Applied in text classification
  • Malenica, Mislav muc, Tomislav Jan, najder
    Dalbelo Baic, Bojana. Language Morphology
    Offset Text Classification on a Croatian-English
    Parallel Corpus. // Information Processing and
    Management, 44 (2008), 1 325-339.

27
  • Thank youfor your attention!
Write a Comment
User Comments (0)
About PowerShow.com