The Syntax-Morphology Interface and Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

The Syntax-Morphology Interface and Natural Language Processing

Description:

Dependency vs. constituency Constituency trees in SzT2.0 35. dia 36. dia Dependency trees in Szeged Dependency Treebank 38. dia Virtual nodes 40. dia Szeged Treebank ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 72
Provided by: VinczeV
Category:

less

Transcript and Presenter's Notes

Title: The Syntax-Morphology Interface and Natural Language Processing


1
The Syntax-Morphology Interface and Natural
Language Processing
  • Veronika Vincze
  • University of Szeged
  • Hungary
  • vinczev_at_inf.u-szeged.hu

Thematic Training Course on Processing
Morphologically Rich Languages 11-15 April 2011
2
Outline
  • Introduction
  • Syntax vs. morphology from a linguistic viewpoint
  • Morphological coding systems in Hungarian
  • Morphosyntactic information in Hungarian corpora
  • Language-specific morphosyntactic problems
  • Effects on IE, NER and MT

3
Syntax vs. morphology
  • Typological differences among languages
  • Agglutinative lg role of morphology is stronger
    (lot of information in morphemes)
  • Isolating lg role of syntax is stronger (less
    morphemes, more constructions)
  • Focus on Hungarian (agglutinative) and English
    (fusional/isolating)

4
Basic Hungarian syntax
  • Lot of information encoded in morphemes
  • No fixed word order
  • Information structure is reflected in word order
    (theme-rheme, old-new)
  • Péter szereti Marit. Peter love-3SgObj Mary-ACC
    Peter loves Mary.
  • Péter Marit szereti. It is Mary who Peter
    loves.
  • Marit szereti Péter. It is Mary who Peter
    loves.
  • Marit Péter szereti. It is Peter who loves
    Mary.
  • Szereti Péter Marit. Peter LOVES Mary (and not
    hates).
  • Szereti Marit Péter. Peter LOVES Mary (and not
    hates).

5
Morphosyntactic features of Hungarian
  • Nominal declination (nouns, adjectives, numerals)
  • Verbal conjugation
  • Several hundreds of word forms for each lemma
  • Grammatical relations encoded primarily by
    morphemes -gt morpho syntactic

6
Nominal suffixes
  • A stem can be extended by
  • Derivational suffixes
  • Plural
  • Possessive
  • Case suffixes
  • hat-ás-a-i-nak to its effects
  • stem-DERIV.SUFF-POSS-POSS.PL-DAT
  • egész-ség-ed-re cheers
  • stem-DERIV.SUFF-POSS.Sg2-SUB

7
Case suffixes in Hungarian
  • 20 cases (rare cases are not always counted
    distributive-temporal (-nte), associative
    (-stul/-stül))
  • always at the right end of the word form
  • grammatical relations are encoded
  • Arguments of the verb
  • Adjuncts (temporal and locative adverbials)

8
and in English
  • Pisti szerdánként edzésre jár.
  • Steve Wednesday-DIST-TEMP training-SUB go-3Sg
  • Each Wednesday Steve goes to training.
  • Szerdánként each Wednesday
  • Edzésre to training

9
  • Pisti bort iszik.
  • Steve wine-ACC drink-3Sg
  • Steve is drinking wine.
  • Pisti-NOM Steve subject
  • Bort wine - object

10
Possessive in Hungarian
  • A fiú kutyája
  • The boy dog-POSS
  • The boys dog
  • A(z o) kutyája
  • The (he) dog-POSS
  • His dog
  • Possessor in nominative
  • Possessed with a possessive marker
  • A fiúnak a kutyája
  • The boy-DAT the dog-POSS
  • Possessor in dative
  • Possessed with a possessive marker

11
and in English
  • The boys dog
  • His dog
  • Possessor with a possessive marker (pronoun)
  • Possessed with no marker
  • The dog of the boy
  • Possessive relation is marked by a preposition

12
Hungarian vs. English - nouns
  • Number of word forms several hundreds (HU) vs.
    2-3 (EN)
  • Means to express grammatical relations
  • Suffixes (HU)
  • Preposition, fixed position (word order), suffix,
    determiner (EN)
  • Methods for morphological parsing are very
    different for Hungarian and English

13
Verbal suffixes
  • A stem can be extended by
  • Derivational suffixes
  • Mood markers
  • Tense markers
  • Person/number suffixes
  • Objective markers
  • Vág-at-ná-k
  • Cut-CAUS-COND-3PlObj
  • they would have it cut

14
Mood and tense in Hungarian
  • Mood
  • Indicative default (not marked)
  • Conditional suffixes (present) analytic form
    (past)
  • Imperative suffixes
  • Tense
  • Present default (not marked)
  • Past suffixes
  • Future analytic (auxiliary fog)

15
and in English
  • Mood
  • Indicative default (not marked)
  • Conditional past tense forms analytic forms
    (auxiliary would)
  • Imperative auxiliaries grammatical structure
  • Tense
  • Present default (not marked)
  • Past suffix / irregular forms (suppletives or
    ablaut (vowel change))
  • Future analytic (auxiliary will)

16
Person Number
  • Hungarian suffixes
  • Fut-ok
  • Fut-sz
  • Fut
  • Fut-unk
  • Fut-tok
  • Fut-nak
  • 3Sg is the default (not marked!)
  • English 3Sg pronouns / obligatory subject
  • I run
  • You run
  • He runs
  • We run
  • You run
  • They run
  • 3Sg marked!

17
Derivational suffixes in Hungarian
  • Possibility/permission
  • fut-hat-ok
  • run-MOD-1Sg
  • I may run
  • Reflexive
  • mos-akod-unk
  • wash-REFL-1Pl
  • we wash ourselves
  • Frequentative
  • üt-öget-sz
  • hit-FREQ-2Sg
  • you hit sg repeatedly
  • Causative
  • csinál-tat-nak
  • do-CAUS-3Pl
  • they have sg done

18
and in English
  • Possibility/permission auxiliaries
  • Reflexive pronominal objects
  • Frequentative adverb
  • Causative construction

19
Hungarian vs. English - verbs
  • Number of word forms several hundreds (HU) vs.
    4-5 (EN)
  • Means to express grammatical relations
  • Suffixes auxiliaries (HU)
  • Auxiliaries reflexive pronouns constructions
    (EN)
  • A lot of syntactic information is encoded in
    Hungarian morphemes

20
Morphology Syntax English
Nominal suffix verb-argument relation word order, preposition
possessive suffix, preposition
Verbal suffix tense suffix
agreement pronoun, suffix
modality auxiliary
causation construction
aspect construction
reflexivity pronoun
21
Morphosyntactic coding systems
  • Language independent (?)
  • Language dependent
  • (dis)advantages
  • comparability
  • considering language-specific features
  • complexity
  • Different information is necessary for each
    language

22
Hungarian coding systems
  • HUMOR
  • recall Thursday Session 1 ?
  • in the Hungarian National Corpus
  • MSD
  • In Szeged Treebank
  • Parser and POS-tagger available at
    http//www.inf.u-szeged.hu/rgai/magyarlanc
  • KR
  • No database
  • Parser and POS-tagger available at
    http//mokk.bme.hu/resources/hunmorph/index_html
  • http//code.google.com/p/hunpos/

23
MSD
  • Morphosyntactic Description
  • International coding system
  • English
  • Romanian
  • Slovenian
  • Czech
  • Bulgarian
  • Estonian
  • Hungarian

24
MSD - 2
  • Positional codes
  • A given position encodes a given type of
    information
  • Position 0 part-of-speech
  • Position 1 (sub)type within POS
  • Further positions other grammatical information
    (person, number, case, etc.)
  • Irrelevant positions are marked with a hyphen (-)

25
KR
  • Created for Hungarian
  • Hierarchical attribute-value matrices
  • Default values (3Sg, singular)
  • Derivational information is encoded
  • Compounds are also segmented

26
MSD vs. KR
  • Differences between the two systems
  • derivation
  • compounds
  • Harmonization efforts in order to build a
    morphological parser the output of which is in
    total harmony with the Szeged Treebank
    (magyarlanc) (Farkas et al. 2010)

27
Nouns in MSD
kutya kutya Nc-sn dog
kutyámat kutya Nc-sa---s1 my dog-ACC
kutyaházaikról kutyaház Nc-ph---p3 about their doghouse
Obamához Obama Np-st to Obama
28
Verbs in MSD
futok fut Vmip1s---n I run
futhatsz fut Voip2s---n you can run
ütögették üt Vfis3p---y they were hitting it
csináltattunk csinál Vsis1p---n we had sg made
29
Morphosyntactically annotated Hungarian corpora
  • Hungarian National Corpus
  • 100-million-word balanced reference corpus of
    present-day Hungarian
  • Word forms automatically annotated for stem, part
    of speech and inflectional information
  • http//corpus.nytud.hu/mnsz/index_eng.html
  • Szeged Treebank
  • 1-million words, 82K sentences
  • Manually annotated for lemma, POS-tags
  • Constituency and dependency trees
  • http//www.inf.u-szeged.hu/rgai/nlp

30
Szeged Treebank
  • Manually annotated treebank for Hungarian
  • Covers various linguistics styles
  • literature, newspapers, laws, student essays,
    computer books, etc.
  • multilingual connection Orwells 1984 Win2000
    manual in Hungarian
  • Available free of charge for research
  • Developed by
  • University of Szeged, HLT group
  • MorphoLogic Ltd.
  • Academy of Sciences, Research Institute for
    Linguistics

31
Szeged Treebank 2.
  • TEI XML format
  • Manually annotated
  • sentence split word segmentation
  • morphological analysis
  • PTB-style syntactic structure
  • Verb argument structure
  • converted / extended to Dependency Grammar format
    manually

32
Szeged Treebank 3.
  • Several versions
  • Constituency and dependency versions
  • Old MSD codes
  • New (harmonized) MSD codes
  • (dependency) parser under development
  • Being extended with folklore texts

33
Dependency vs. constituency
  • Each node corresponds to a word -gt no virtual
    nodes (CP, I) in dependency trees
  • Constituency grammars said to be good for
    languages with fixed word order
  • Syntactic relations are determined
  • by the position in the tree (constituency
    grammar)
  • by dependency relations (labeled edges)
    (dependency)

34
Constituency trees in SzT2.0
  • Based on generative syntax (É. Kiss et al. 1999)
  • Syntactic features of Hungarian also considered
    (i.e. not hardcore Chomskyan trees)
  • Verb-argument relations are encoded by labels
  • Very detailed information different grammatical
    role for each case suffix
  • Semantic information also can be found (temporal
    and locative adverbials)

35
Aggie all relative-POSS-ACC the day before
yesterday see-PAST-3Sg-Obj guest-ESS Aggie
received all of her relatives the day before
yesterday.
36
(No Transcript)
37
Dependency trees in Szeged Dependency Treebank
  • Based on SzT2.0
  • Automatic conversion and manual correction
  • Word forms are the nodes of the tree
  • Simplified relations for nominal arguments SUBJ,
    OBJ, DAT,OBL, ATT
  • Semantic information kept
  • Sentences without 3Sg copula are distinctively
    marked

38
Winston Smith, his chin nuzzled into his breast
in an effort to escape the vile wind, slipped
quickly through the glass doors of Victory
Mansions.
39
Virtual nodes
  • No overt copula in present tense 3Sg
  • Only subject and predicative noun/adjective
    manifest
  • No syntactic structure in SzT (grammatical roles
    are not marked)
  • Virtual nodes in SzDT

40
I like to go to school because it is good to be
at school though not always.
41
Szeged Treebank vs. Szeged Dependency Treebank
  • Labeled relations in both cases -gt not so sharp
    contrast
  • Virtual nodes in SzDT -gt grammatical structure
    marked for every sentence (IE, MT)
  • No word order constraints in SzDT
  • Word forms are marked
  • Other possibilities morpheme-based syntax
    (Prószéky et al. (1989), Koutny, Wacha (1991))

42
Language-specific morphosyntactic problems
  • Morphology vs. syntax
  • Pseudo-subjects
  • Pseudo-objects
  • Pseudo-datives
  • Morphological analysis of unknown words
  • Lemmatization of named entities

43
Pseudo-subjects
  • a noun in nominative is not the subject of the
    sentence -gt special attention required when
    parsing
  • Possessor a kisfiú labdája
  • the boy ball-3SgPOSS
  • the boys ball
  • Predicative noun István juhász maradt.
  • Stephen shepherd remain-PAST
  • Stephen remained a shepherd.
  • Object A kutyám kergeti a macska.
  • The dog-POSS chase-3SgObj the cat
  • The cat is chasing my dog. (garden path
    sentence)
  • A fiam szereti a lányod.
  • The son-1SgPOSS love-3SgObj the daughter-2SgPOSS
  • My son loves your daughter or Your daughter
    loves my son

44
Solutions
  • Possessor
  • SzT one NP includes the possessor and the
    possessed ((a kisfiú) labdája)
  • SzDT ATT relation
  • Predicative noun PRED relation
  • Virtual node in SzDT
  • Object OBJ relation
  • Sometimes contextual information is needed even
    for humans

45
Pseudo-objects
  • Adverbials with an apparently accusative ending
  • Futottam egy jót.
  • Run-PAST-1Sg a good-ACC
  • I have had a good run.
  • Nagyot aludtam.
  • Big-ACC sleep-PAST-1Sg
  • I have slept a lot.
  • Intransitive verbs -gt cannot be an object -gt MODE
    relation

46
Pseudo-datives
  • Not all (semantic) subjects are in nominative
  • Dative subject
  • Sándornak kell elrendeznie az ügyeket.
  • Alexander-DAT must arrange-INF-3Sg the issue-PL
  • Alexander has to arrange the issues.
  • DAT in both corpora
  • Certain auxiliaries with dative subjects
    (exceptions)
  • Dative-nominative parallelism in possessive as
    well

47
Unknown words
  • Unknown words can be
  • Compounds
  • Named entities
  • Derivations
  • fémkapunk
  • félmillió
  • csokinyúl
  • NATO-hoz
  • Methods for analysis (Zsibrita et al. 2010)
  • Segmentation into two or more analyzable parts
  • Expert rules to filter impossible combinations
    (VN)
  • Analysis of the last part goes to the whole word
  • Substitution for hyphenated words (pre-defined
    patterns for each morphological class)

48
félmillió
  • félmillió
  • Mc-snl

fél N half
ADJ half
NUM half
V be afraid
millió NUM million
Expert rules NUM NUM non-NUM NUM
49
fémkapunk
  • fémkapunk
  • Vmip1p---n
  • fémkapunk
  • Nc-sn---p1

fém N metal
kap V get
kapu N gate
unk S 1Pl (verb)
nk S 1PlPoss (noun)
Expert rules N N N-nonNOM V N-NOM V
50
csokinyúl
  • csokinyúl
  • Vmip3s---n
  • Nc-sn
  • csokinyúl (?)
  • Vmip3s---n

csoki N chocolate
nyúl N rabbit
V stretch
kinyúl V stretch out
Expert rules N N N-nonNOM V N-NOM V
51
NATO-hoz
  • NATO-hoz
  • NATO V
  • Vmip3s---n
  • NATO-hoz (kalaphoz)
  • NATO N
  • Np-st
  • Ordering of rules
  • substitution
  • segmentation

NATO ? NATO
hoz V bring
S to
Expert rules N - S N-nonNOM - V N-NOM
- V V - V Substitution NATO- -gt kalap
hat
52
Lemmatization
  • Lemmatization (i.e. dividing the word form into
    its root and affixes) is not a trivial task in
    morphologically rich languages such as Hungarian
  • common nouns relying on a good dictionary
  • NEs cannot be listed
  • Problem the NE ends in an apparent suffix

53
Lemmatization of NEs
  • each ending that seems to be a possible suffix is
    cut off the NE in step-by-step fashion
  • Citroenben
  • Citroenben (lemma)
  • Citroen ben in (a) Citroen
  • Citroenb en on (a) Citroenb
  • Citroenbe n on (a) Citroenbe
  • Each possible lemma undergoes a Google and a
    Yahoo search the most frequent one is chosen
    (Farkas et al. 2008)

54
NLP applications
  • NER
  • NEs with suffixes
  • Information extraction
  • Modality, uncertainty
  • Causation
  • Machine translation
  • Morphemes vs. structures

55
Named Entities
  • NEs should be recognized
  • They should be morphosyntactically tagged -gt
    proper syntactic/semantic analysis
  • A Citroenben a Peugeot meghatározó
    tulajdonhányadot szerez.
  • Mini dictionary suffix list semantic frame

56
a DET the
ben S in
Citroenben ?
en S on
meghatározó ADJ dominant
n S on
ot S ACC
Peugeot ?
szerez V acquire
t S ACC
tulajdonrész N interest
57
Possible analyses
  • Citroenben
  • Citroenben
  • Citroen ben Citroen-INE
  • Citroenb en Citroenb-SUP
  • Citroenbe n Citroenbe-SUP
  • Peugeot
  • Peugeot
  • Peugeo t Peugeo-ACC
  • Peuge ot Peuge-ACC

58
A semantic frame
  • ltevent frametransaction.ownerchangegt1V("szerez"
    "vásárol""vesz""megvesz""megvásárol""felvásár
    ol")subject2direct_object3
  • ltrv rolebuyergt2Nlt/rvgt
  • 3N("részesedés""tulajdon""tulajdonrész"
    "rész
  • tulajdonhányad)compl14modified_by_adj5
  • ltrv roleproductgt4Ncaseineceglt/rvgt
  • ltrv rolenewsharegt5Ameasuremodified_by_n
    umber6
  • 6NBlt/rvgt
  • lt/eventgt

59
Analysis
  • A Citroenben a Peugeot meghatározó
    tulajdonhányadot szerez.
  • Tulajdonhányadot -gt ACC/OBJ (3)
  • Citroenben -gt INE (4)
  • Peugeot -gt NOM/SUBJ (2)
  • Peugeot acquires a dominant interest in
    Citroen.

60
Uncertainty
  • Text Mining
  • derive facts from free text
  • uncertainty and negation have an impact on the
    quality/nature of the information extracted
  • applications have to treat sentences / clauses
    containing uncertain or negated information
    differently from factual information
  • Uncertainty possible existence of a thing
    (neither its existence nor its non-existence is
    claimed)

61
Uncertainty detection
  • Uncertainty detection in English cues (words
    with uncertain content)
  • One typical means to express uncertainty in
    Hungarian -hat/het
  • High school grades may influence health.
  • A középiskolai jegyek kihathatnak az egészségre.
  • Morphological analysis should reflect modality
    (Voip3s---n)

62
Causation
  • Semantic/thematic relations to be determined
    properly
  • AGENT ! SUBJECT
  • Varrattam egy ruhát.
  • sew-CAUS-PAST-1Sg a dress-ACC
  • I had a dress sewn.
  • Varrattam Marival egy ruhát.
  • sew-CAUS-PAST-1Sg Mari-INS a dress-ACC
  • I had Mary sew a dress.
  • Varrtam Marival egy ruhát.
  • sew-PAST-1Sg Mari-INS a dress-ACC
  • I sewed a dress with Mary.
  • Causative information should be encoded
    (Vsip3s---n)

63
Argument structure of causative verbs
Agent Beneficiary Patient
Varrattam egy ruhát. ? I (NOM) ruha (ACC)
Varrattam Marival egy ruhát. Mari (INS) I (NOM) ruha (ACC)
Varrtam Marival egy ruhát. I (NOM) Mari (INS) ? ruha (ACC)
64
Machine translation
  • Morpheme-based translation would be ideal
  • Easier alignment of translational units
  • Good morphological parser needed
  • Easier to execute in dependency grammar
  • Morpheme-based dependency structures

65
Alignments
  • at
  • varr
  • t
  • ruha

ban ház am
in house my
  • have
  • sewn
  • dress

66
Problems
  • Not practical no corpus available at the moment
  • Portmanteau morphs alignment problems
  • Zero morphs how many of them?
  • 3 zero morphs in Hungarian nouns
  • könyv-Ø-Ø-Ø vs. könyveit
  • book-Ø-Ø-Ø book-POSS-POSS.PL-ACC
  • (Melcuk 2006)

67
  • Morphosyntactic codes might help
  • Csinálhattátok Vois2p---y
  • Reordering rules

V csinál do
o hat can
i - -
s t PAST
2p tok you
y á it
csinálhattátok you could do it
68
An example
  • hat
  • csinál
  • / \
  • t á tok
  • can
  • do
  • / \
  • d Ø you

could / \ you do
69
Syntax vs. case suffix
Pseudo-subject Extra rules PRED, OBJ difficult for humans
Pseudo-object List of adverbs with accusative ending
Pseudo-dative List of verbs with dative subject
Unknown words (lemmassuffixes) Guessing (rules)
Information extraction
Thematic/semantic relations Proper morphosyntactic codes rules
Uncertainty detection Proper morphosyntactic codes
Machine translation (morpheme-based) Proper morphosyntactic codes
70
Summary
  • Syntax-morphology interface in Hungarian
  • Morphological coding systems
  • Syntactic annotation in Hungarian corpora
  • Morphosyntactic problems
  • NER
  • IE
  • MT

71
References
  • É. Kiss K., Kiefer F., Siptár P. Új magyar
    nyelvtan, Osiris Kiadó, Bp., 1999.
  • Farkas Richárd, Szeredi Dániel, Varga Dániel,
    Vincze Veronika 2010 MSD-KR harmonizáció a
    Szeged Treebank 2.5-ben. In Tanács Attila,
    Vincze Veronika (szerk.) VII. Magyar
    Számítógépes Nyelvészeti Konferencia. Szeged,
    Szegedi Tudományegyetem, pp. 349-353.
  • Farkas, Richárd Vincze, Veronika Nagy, István
    Ormándi, Róbert Szarvas, György Almási, Attila
    2008 Web-based lemmatisation of Named Entities.
    In Horák, Ales Kopecek, Ivan Pala, Karel
    Sojka, Petr (eds.) Proceedings of the 11th
    International Conference on Text, Speech and
    Dialogue (TSD2008), Berlin, Heidelberg, Springer
    Verlag, LNCS 5246, pp. 53-60.
  • Koutny I., Wacha B. Magyar nyelvtan függoségi
    alapon. Magyar Nyelv Vol. 87 No. 4. (1991)
    393404.
  • Melcuk, Igor 2006 Aspects of the Theory of
    Morphology. Mouton de Gruyter.
  • Prószéky, G., Koutny, I., Wacha, B. Dependency
    Syntax of Hungarian. In Maxwell, Dan Klaus
    Schubert (eds.) Metataxis in Practice (Dependency
    Syntax for Multilingual Machine Translation),
    Foris, Dordrecht, The Netherlands (1989) 151181
  • Zsibrita János, Vincze Veronika, Farkas Richárd
    2010 Ismeretlen kifejezések és a szófaji
    egyértelmusítés. In Tanács Attila, Vincze
    Veronika (szerk.) VII. Magyar Számítógépes
    Nyelvészeti Konferencia. Szeged, Szegedi
    Tudományegyetem, pp. 275-283.
Write a Comment
User Comments (0)
About PowerShow.com