Title: The Syntax-Morphology Interface and Natural Language Processing
1The Syntax-Morphology Interface and Natural
Language Processing
- Veronika Vincze
- University of Szeged
- Hungary
- vinczev_at_inf.u-szeged.hu
Thematic Training Course on Processing
Morphologically Rich Languages 11-15 April 2011
2Outline
- Introduction
- Syntax vs. morphology from a linguistic viewpoint
- Morphological coding systems in Hungarian
- Morphosyntactic information in Hungarian corpora
- Language-specific morphosyntactic problems
- Effects on IE, NER and MT
3Syntax vs. morphology
- Typological differences among languages
- Agglutinative lg role of morphology is stronger
(lot of information in morphemes) - Isolating lg role of syntax is stronger (less
morphemes, more constructions) - Focus on Hungarian (agglutinative) and English
(fusional/isolating)
4Basic Hungarian syntax
- Lot of information encoded in morphemes
- No fixed word order
- Information structure is reflected in word order
(theme-rheme, old-new) - Péter szereti Marit. Peter love-3SgObj Mary-ACC
Peter loves Mary. - Péter Marit szereti. It is Mary who Peter
loves. - Marit szereti Péter. It is Mary who Peter
loves. - Marit Péter szereti. It is Peter who loves
Mary. - Szereti Péter Marit. Peter LOVES Mary (and not
hates). - Szereti Marit Péter. Peter LOVES Mary (and not
hates).
5Morphosyntactic features of Hungarian
- Nominal declination (nouns, adjectives, numerals)
- Verbal conjugation
- Several hundreds of word forms for each lemma
- Grammatical relations encoded primarily by
morphemes -gt morpho syntactic
6Nominal suffixes
- A stem can be extended by
- Derivational suffixes
- Plural
- Possessive
- Case suffixes
- hat-ás-a-i-nak to its effects
- stem-DERIV.SUFF-POSS-POSS.PL-DAT
- egész-ség-ed-re cheers
- stem-DERIV.SUFF-POSS.Sg2-SUB
7Case suffixes in Hungarian
- 20 cases (rare cases are not always counted
distributive-temporal (-nte), associative
(-stul/-stül)) - always at the right end of the word form
- grammatical relations are encoded
- Arguments of the verb
- Adjuncts (temporal and locative adverbials)
8and in English
- Pisti szerdánként edzésre jár.
- Steve Wednesday-DIST-TEMP training-SUB go-3Sg
- Each Wednesday Steve goes to training.
- Szerdánként each Wednesday
- Edzésre to training
9- Pisti bort iszik.
- Steve wine-ACC drink-3Sg
- Steve is drinking wine.
- Pisti-NOM Steve subject
- Bort wine - object
10Possessive in Hungarian
- A fiú kutyája
- The boy dog-POSS
- The boys dog
- A(z o) kutyája
- The (he) dog-POSS
- His dog
- Possessor in nominative
- Possessed with a possessive marker
- A fiúnak a kutyája
- The boy-DAT the dog-POSS
- Possessor in dative
- Possessed with a possessive marker
11and in English
- The boys dog
- His dog
- Possessor with a possessive marker (pronoun)
- Possessed with no marker
- The dog of the boy
- Possessive relation is marked by a preposition
12Hungarian vs. English - nouns
- Number of word forms several hundreds (HU) vs.
2-3 (EN) - Means to express grammatical relations
- Suffixes (HU)
- Preposition, fixed position (word order), suffix,
determiner (EN) - Methods for morphological parsing are very
different for Hungarian and English
13Verbal suffixes
- A stem can be extended by
- Derivational suffixes
- Mood markers
- Tense markers
- Person/number suffixes
- Objective markers
- Vág-at-ná-k
- Cut-CAUS-COND-3PlObj
- they would have it cut
14Mood and tense in Hungarian
- Mood
- Indicative default (not marked)
- Conditional suffixes (present) analytic form
(past) - Imperative suffixes
- Tense
- Present default (not marked)
- Past suffixes
- Future analytic (auxiliary fog)
15and in English
- Mood
- Indicative default (not marked)
- Conditional past tense forms analytic forms
(auxiliary would) - Imperative auxiliaries grammatical structure
- Tense
- Present default (not marked)
- Past suffix / irregular forms (suppletives or
ablaut (vowel change)) - Future analytic (auxiliary will)
16Person Number
- Hungarian suffixes
- Fut-ok
- Fut-sz
- Fut
- Fut-unk
- Fut-tok
- Fut-nak
- 3Sg is the default (not marked!)
- English 3Sg pronouns / obligatory subject
- I run
- You run
- He runs
- We run
- You run
- They run
- 3Sg marked!
17Derivational suffixes in Hungarian
- Possibility/permission
- fut-hat-ok
- run-MOD-1Sg
- I may run
- Reflexive
- mos-akod-unk
- wash-REFL-1Pl
- we wash ourselves
- Frequentative
- üt-öget-sz
- hit-FREQ-2Sg
- you hit sg repeatedly
- Causative
- csinál-tat-nak
- do-CAUS-3Pl
- they have sg done
18 and in English
- Possibility/permission auxiliaries
- Reflexive pronominal objects
- Frequentative adverb
- Causative construction
19Hungarian vs. English - verbs
- Number of word forms several hundreds (HU) vs.
4-5 (EN) - Means to express grammatical relations
- Suffixes auxiliaries (HU)
- Auxiliaries reflexive pronouns constructions
(EN) - A lot of syntactic information is encoded in
Hungarian morphemes
20Morphology Syntax English
Nominal suffix verb-argument relation word order, preposition
possessive suffix, preposition
Verbal suffix tense suffix
agreement pronoun, suffix
modality auxiliary
causation construction
aspect construction
reflexivity pronoun
21Morphosyntactic coding systems
- Language independent (?)
- Language dependent
- (dis)advantages
- comparability
- considering language-specific features
- complexity
- Different information is necessary for each
language
22Hungarian coding systems
- HUMOR
- recall Thursday Session 1 ?
- in the Hungarian National Corpus
- MSD
- In Szeged Treebank
- Parser and POS-tagger available at
http//www.inf.u-szeged.hu/rgai/magyarlanc - KR
- No database
- Parser and POS-tagger available at
http//mokk.bme.hu/resources/hunmorph/index_html - http//code.google.com/p/hunpos/
23MSD
- Morphosyntactic Description
- International coding system
- English
- Romanian
- Slovenian
- Czech
- Bulgarian
- Estonian
- Hungarian
24MSD - 2
- Positional codes
- A given position encodes a given type of
information - Position 0 part-of-speech
- Position 1 (sub)type within POS
- Further positions other grammatical information
(person, number, case, etc.) - Irrelevant positions are marked with a hyphen (-)
25KR
- Created for Hungarian
- Hierarchical attribute-value matrices
- Default values (3Sg, singular)
- Derivational information is encoded
- Compounds are also segmented
26MSD vs. KR
- Differences between the two systems
- derivation
- compounds
- Harmonization efforts in order to build a
morphological parser the output of which is in
total harmony with the Szeged Treebank
(magyarlanc) (Farkas et al. 2010)
27Nouns in MSD
kutya kutya Nc-sn dog
kutyámat kutya Nc-sa---s1 my dog-ACC
kutyaházaikról kutyaház Nc-ph---p3 about their doghouse
Obamához Obama Np-st to Obama
28Verbs in MSD
futok fut Vmip1s---n I run
futhatsz fut Voip2s---n you can run
ütögették üt Vfis3p---y they were hitting it
csináltattunk csinál Vsis1p---n we had sg made
29Morphosyntactically annotated Hungarian corpora
- Hungarian National Corpus
- 100-million-word balanced reference corpus of
present-day Hungarian - Word forms automatically annotated for stem, part
of speech and inflectional information - http//corpus.nytud.hu/mnsz/index_eng.html
- Szeged Treebank
- 1-million words, 82K sentences
- Manually annotated for lemma, POS-tags
- Constituency and dependency trees
- http//www.inf.u-szeged.hu/rgai/nlp
30Szeged Treebank
- Manually annotated treebank for Hungarian
- Covers various linguistics styles
- literature, newspapers, laws, student essays,
computer books, etc. - multilingual connection Orwells 1984 Win2000
manual in Hungarian - Available free of charge for research
- Developed by
- University of Szeged, HLT group
- MorphoLogic Ltd.
- Academy of Sciences, Research Institute for
Linguistics
31Szeged Treebank 2.
- TEI XML format
- Manually annotated
- sentence split word segmentation
- morphological analysis
- PTB-style syntactic structure
- Verb argument structure
- converted / extended to Dependency Grammar format
manually
32Szeged Treebank 3.
- Several versions
- Constituency and dependency versions
- Old MSD codes
- New (harmonized) MSD codes
- (dependency) parser under development
- Being extended with folklore texts
33Dependency vs. constituency
- Each node corresponds to a word -gt no virtual
nodes (CP, I) in dependency trees - Constituency grammars said to be good for
languages with fixed word order - Syntactic relations are determined
- by the position in the tree (constituency
grammar) - by dependency relations (labeled edges)
(dependency)
34Constituency trees in SzT2.0
- Based on generative syntax (É. Kiss et al. 1999)
- Syntactic features of Hungarian also considered
(i.e. not hardcore Chomskyan trees) - Verb-argument relations are encoded by labels
- Very detailed information different grammatical
role for each case suffix - Semantic information also can be found (temporal
and locative adverbials)
35Aggie all relative-POSS-ACC the day before
yesterday see-PAST-3Sg-Obj guest-ESS Aggie
received all of her relatives the day before
yesterday.
36(No Transcript)
37Dependency trees in Szeged Dependency Treebank
- Based on SzT2.0
- Automatic conversion and manual correction
- Word forms are the nodes of the tree
- Simplified relations for nominal arguments SUBJ,
OBJ, DAT,OBL, ATT - Semantic information kept
- Sentences without 3Sg copula are distinctively
marked
38Winston Smith, his chin nuzzled into his breast
in an effort to escape the vile wind, slipped
quickly through the glass doors of Victory
Mansions.
39Virtual nodes
- No overt copula in present tense 3Sg
- Only subject and predicative noun/adjective
manifest - No syntactic structure in SzT (grammatical roles
are not marked) - Virtual nodes in SzDT
40I like to go to school because it is good to be
at school though not always.
41Szeged Treebank vs. Szeged Dependency Treebank
- Labeled relations in both cases -gt not so sharp
contrast - Virtual nodes in SzDT -gt grammatical structure
marked for every sentence (IE, MT) - No word order constraints in SzDT
- Word forms are marked
- Other possibilities morpheme-based syntax
(Prószéky et al. (1989), Koutny, Wacha (1991))
42Language-specific morphosyntactic problems
- Morphology vs. syntax
- Pseudo-subjects
- Pseudo-objects
- Pseudo-datives
- Morphological analysis of unknown words
- Lemmatization of named entities
43Pseudo-subjects
- a noun in nominative is not the subject of the
sentence -gt special attention required when
parsing - Possessor a kisfiú labdája
- the boy ball-3SgPOSS
- the boys ball
- Predicative noun István juhász maradt.
- Stephen shepherd remain-PAST
- Stephen remained a shepherd.
- Object A kutyám kergeti a macska.
- The dog-POSS chase-3SgObj the cat
- The cat is chasing my dog. (garden path
sentence) - A fiam szereti a lányod.
- The son-1SgPOSS love-3SgObj the daughter-2SgPOSS
- My son loves your daughter or Your daughter
loves my son
44Solutions
- Possessor
- SzT one NP includes the possessor and the
possessed ((a kisfiú) labdája) - SzDT ATT relation
- Predicative noun PRED relation
- Virtual node in SzDT
- Object OBJ relation
- Sometimes contextual information is needed even
for humans
45Pseudo-objects
- Adverbials with an apparently accusative ending
- Futottam egy jót.
- Run-PAST-1Sg a good-ACC
- I have had a good run.
- Nagyot aludtam.
- Big-ACC sleep-PAST-1Sg
- I have slept a lot.
- Intransitive verbs -gt cannot be an object -gt MODE
relation
46Pseudo-datives
- Not all (semantic) subjects are in nominative
- Dative subject
- Sándornak kell elrendeznie az ügyeket.
- Alexander-DAT must arrange-INF-3Sg the issue-PL
- Alexander has to arrange the issues.
- DAT in both corpora
- Certain auxiliaries with dative subjects
(exceptions) - Dative-nominative parallelism in possessive as
well
47Unknown words
- Unknown words can be
- Compounds
- Named entities
- Derivations
- fémkapunk
- félmillió
- csokinyúl
- NATO-hoz
- Methods for analysis (Zsibrita et al. 2010)
- Segmentation into two or more analyzable parts
- Expert rules to filter impossible combinations
(VN) - Analysis of the last part goes to the whole word
- Substitution for hyphenated words (pre-defined
patterns for each morphological class)
48félmillió
fél N half
ADJ half
NUM half
V be afraid
millió NUM million
Expert rules NUM NUM non-NUM NUM
49fémkapunk
- fémkapunk
- Vmip1p---n
- fémkapunk
- Nc-sn---p1
fém N metal
kap V get
kapu N gate
unk S 1Pl (verb)
nk S 1PlPoss (noun)
Expert rules N N N-nonNOM V N-NOM V
50csokinyúl
- csokinyúl
- Vmip3s---n
- Nc-sn
- csokinyúl (?)
- Vmip3s---n
csoki N chocolate
nyúl N rabbit
V stretch
kinyúl V stretch out
Expert rules N N N-nonNOM V N-NOM V
51NATO-hoz
- NATO-hoz
- NATO V
- Vmip3s---n
- NATO-hoz (kalaphoz)
- NATO N
- Np-st
- Ordering of rules
- substitution
- segmentation
NATO ? NATO
hoz V bring
S to
Expert rules N - S N-nonNOM - V N-NOM
- V V - V Substitution NATO- -gt kalap
hat
52Lemmatization
- Lemmatization (i.e. dividing the word form into
its root and affixes) is not a trivial task in
morphologically rich languages such as Hungarian - common nouns relying on a good dictionary
- NEs cannot be listed
- Problem the NE ends in an apparent suffix
53Lemmatization of NEs
- each ending that seems to be a possible suffix is
cut off the NE in step-by-step fashion - Citroenben
- Citroenben (lemma)
- Citroen ben in (a) Citroen
- Citroenb en on (a) Citroenb
- Citroenbe n on (a) Citroenbe
- Each possible lemma undergoes a Google and a
Yahoo search the most frequent one is chosen
(Farkas et al. 2008)
54NLP applications
- NER
- NEs with suffixes
- Information extraction
- Modality, uncertainty
- Causation
- Machine translation
- Morphemes vs. structures
55Named Entities
- NEs should be recognized
- They should be morphosyntactically tagged -gt
proper syntactic/semantic analysis - A Citroenben a Peugeot meghatározó
tulajdonhányadot szerez. - Mini dictionary suffix list semantic frame
56a DET the
ben S in
Citroenben ?
en S on
meghatározó ADJ dominant
n S on
ot S ACC
Peugeot ?
szerez V acquire
t S ACC
tulajdonrész N interest
57Possible analyses
- Citroenben
- Citroenben
- Citroen ben Citroen-INE
- Citroenb en Citroenb-SUP
- Citroenbe n Citroenbe-SUP
- Peugeot
- Peugeot
- Peugeo t Peugeo-ACC
- Peuge ot Peuge-ACC
58A semantic frame
- ltevent frametransaction.ownerchangegt1V("szerez"
"vásárol""vesz""megvesz""megvásárol""felvásár
ol")subject2direct_object3 - ltrv rolebuyergt2Nlt/rvgt
- 3N("részesedés""tulajdon""tulajdonrész"
"rész - tulajdonhányad)compl14modified_by_adj5
- ltrv roleproductgt4Ncaseineceglt/rvgt
- ltrv rolenewsharegt5Ameasuremodified_by_n
umber6 - 6NBlt/rvgt
- lt/eventgt
59Analysis
- A Citroenben a Peugeot meghatározó
tulajdonhányadot szerez. - Tulajdonhányadot -gt ACC/OBJ (3)
- Citroenben -gt INE (4)
- Peugeot -gt NOM/SUBJ (2)
- Peugeot acquires a dominant interest in
Citroen.
60Uncertainty
- Text Mining
- derive facts from free text
- uncertainty and negation have an impact on the
quality/nature of the information extracted - applications have to treat sentences / clauses
containing uncertain or negated information
differently from factual information - Uncertainty possible existence of a thing
(neither its existence nor its non-existence is
claimed)
61Uncertainty detection
- Uncertainty detection in English cues (words
with uncertain content) - One typical means to express uncertainty in
Hungarian -hat/het - High school grades may influence health.
- A középiskolai jegyek kihathatnak az egészségre.
- Morphological analysis should reflect modality
(Voip3s---n)
62Causation
- Semantic/thematic relations to be determined
properly - AGENT ! SUBJECT
- Varrattam egy ruhát.
- sew-CAUS-PAST-1Sg a dress-ACC
- I had a dress sewn.
- Varrattam Marival egy ruhát.
- sew-CAUS-PAST-1Sg Mari-INS a dress-ACC
- I had Mary sew a dress.
- Varrtam Marival egy ruhát.
- sew-PAST-1Sg Mari-INS a dress-ACC
- I sewed a dress with Mary.
- Causative information should be encoded
(Vsip3s---n)
63Argument structure of causative verbs
Agent Beneficiary Patient
Varrattam egy ruhát. ? I (NOM) ruha (ACC)
Varrattam Marival egy ruhát. Mari (INS) I (NOM) ruha (ACC)
Varrtam Marival egy ruhát. I (NOM) Mari (INS) ? ruha (ACC)
64Machine translation
- Morpheme-based translation would be ideal
- Easier alignment of translational units
- Good morphological parser needed
- Easier to execute in dependency grammar
- Morpheme-based dependency structures
65Alignments
ban ház am
in house my
66Problems
- Not practical no corpus available at the moment
- Portmanteau morphs alignment problems
- Zero morphs how many of them?
- 3 zero morphs in Hungarian nouns
- könyv-Ø-Ø-Ø vs. könyveit
- book-Ø-Ø-Ø book-POSS-POSS.PL-ACC
- (Melcuk 2006)
67- Morphosyntactic codes might help
- Csinálhattátok Vois2p---y
- Reordering rules
V csinál do
o hat can
i - -
s t PAST
2p tok you
y á it
csinálhattátok you could do it
68An example
could / \ you do
69Syntax vs. case suffix
Pseudo-subject Extra rules PRED, OBJ difficult for humans
Pseudo-object List of adverbs with accusative ending
Pseudo-dative List of verbs with dative subject
Unknown words (lemmassuffixes) Guessing (rules)
Information extraction
Thematic/semantic relations Proper morphosyntactic codes rules
Uncertainty detection Proper morphosyntactic codes
Machine translation (morpheme-based) Proper morphosyntactic codes
70Summary
- Syntax-morphology interface in Hungarian
- Morphological coding systems
- Syntactic annotation in Hungarian corpora
- Morphosyntactic problems
- NER
- IE
- MT
71References
- É. Kiss K., Kiefer F., Siptár P. Új magyar
nyelvtan, Osiris Kiadó, Bp., 1999. - Farkas Richárd, Szeredi Dániel, Varga Dániel,
Vincze Veronika 2010 MSD-KR harmonizáció a
Szeged Treebank 2.5-ben. In Tanács Attila,
Vincze Veronika (szerk.) VII. Magyar
Számítógépes Nyelvészeti Konferencia. Szeged,
Szegedi Tudományegyetem, pp. 349-353. - Farkas, Richárd Vincze, Veronika Nagy, István
Ormándi, Róbert Szarvas, György Almási, Attila
2008 Web-based lemmatisation of Named Entities.
In Horák, Ales Kopecek, Ivan Pala, Karel
Sojka, Petr (eds.) Proceedings of the 11th
International Conference on Text, Speech and
Dialogue (TSD2008), Berlin, Heidelberg, Springer
Verlag, LNCS 5246, pp. 53-60. - Koutny I., Wacha B. Magyar nyelvtan függoségi
alapon. Magyar Nyelv Vol. 87 No. 4. (1991)
393404. - Melcuk, Igor 2006 Aspects of the Theory of
Morphology. Mouton de Gruyter. - Prószéky, G., Koutny, I., Wacha, B. Dependency
Syntax of Hungarian. In Maxwell, Dan Klaus
Schubert (eds.) Metataxis in Practice (Dependency
Syntax for Multilingual Machine Translation),
Foris, Dordrecht, The Netherlands (1989) 151181 - Zsibrita János, Vincze Veronika, Farkas Richárd
2010 Ismeretlen kifejezések és a szófaji
egyértelmusítés. In Tanács Attila, Vincze
Veronika (szerk.) VII. Magyar Számítógépes
Nyelvészeti Konferencia. Szeged, Szegedi
Tudományegyetem, pp. 275-283.