Title: Machine Learning Approach to Automatic Functor Assignment in Prague Dependency Treebank
1Machine Learning Approach to Automatic Functor
Assignment in Prague Dependency Treebank
- Sao Deroski
- Institut Joef Stefan, Ljubljana
- Department of Knowledge Technologies
- Joint work with Zdenek abokrtský, Petr Sgall
- Charles University, Prague
- Institute of Formal and Applied Linguistics
2Outline
- Materials
- The Prague Dependency Treebank
- Analytical and Tectogramatical Tree Structures
- Training and Testing Sets / Representation
- Methods
- Data flow
- Machine Learning-based
- Rule-based
- Dictionary-based
- Results
- Conclusions and further work
3Prague Dependency Treebank (PDT)
- Long-term project aimed at a complex annotation
of a part of the Czech National Corpus
with rich annotation scheme - Institute of Formal and Applied Linguistics
- Established in 1990 at the Faculty of Mathematics
and Physics, Charles University, Prague - Jan Hajic, Eva Hajicová, Jarmila Panevová, Petr
Sgall - http//ufal.mff.cuni.cz
4Prague Dependency Treebank
- Inspiration
- The Penn Treebank (the most widely used
syntactically annotated corpus of English) - Motivation
- The treebank can be used for further linguistic
research - More accurate results can be obtained (on a
number of tasks) when using annotated corpora
than when using raw texts - PDT reaches representations suitable as input for
semantic interpretation, unlike most other
annotations
5Layered structure of PDT
- Morphological level
- Full morphological tagging (word forms, lemmas,
mor. tags) - Analytical level
- Surface syntax
- Syntactic annotation using dependency syntax
(captures analytical functions such as subject,
object,...) - Tectogrammatical level
- Level of linguistic meaning (tectogrammatical
functions such as actor, patient,...)
Raw text
Morphologically tagged text
Analytic tree structures (ATS)
Tectogrammatical tree structures (TGTS)
6The Analytical Level
- The dependency structure chosen to represent the
syntactic relations within the sentence - Output of the analytical level analytical tree
structure - Oriented, acyclic graph with one entry node
- Every word form and punctuation mark is a node
- The nodes are annotated by attribute-value pairs
- New attribute analytical function
- Determines the relation between the dependent
node and its governing nodes - Values Sb, Obj, Adv, Atr,....
7The Tectogrammatical Level
- Based on the framework of the Functional
Generative Description as developed by Petr Sgall - In comparison to the ATSs, the tectogrammatical
tree structures (TGTSs) have the following
characteristics - Only autosemantic words have an own node,
function words (conjunctions, prepositions) are
attached as indices to the autosemantic words to
which they belong - Nodes are added in case of clearly specified
deletions on the surface level - Analytical functions are substituted by
tectogrammatical functions (functors), such as
Actor, Patient, Addressee,...
8Functors
- Tectogrammatical counterparts of analytical
functions - About 60 functors
- Arguments (or theta roles) and adjuncts
- Actants (Actor, Patient, Adressee, Origin,
Effect) - Free modifiers (LOC, RSTR, TWHEN, THL,...)
- Provide more detailed information about the
relation to the governing node than the
analytical function
9AN EXAMPLE ATS Michalkova upozornila, e zatim
je zbytecne podavat na spravu adosti ci
adat ji o podrobneji informace. Literal
translation Michalkova pointed-out that
meanwhile is superfluous to-submit to
administration requests or to-ask it for
more-detailed information.
10AN EXAMPLE TGTS FOR THE SENTENCE M. pointed out
that for the time being it was superfluous to
submit requests to the administration, or to ask
it for a more detailed information.
Literal translation Michalkova pointed-out
that meanwhile is superfluous to-submit to
administration requests or to-ask it for
more-detailed information.
11AN EXAMPLE TGTS FOR THE SENTENCEThe valuable
and fascinating cultural event documents that
the long-term high-quality strategy of the
Painted House exhibitions, established by L. K.,
attracts further activities in the domains of
art and culture.
12Some TG Functors
- ACMP (accompaniement) mothers with children
- ACT (actor) Peter read a letter.
- ADDR (addressee) Peter gave Mary a book.
- ADVS (adversative) He came there, but didn't
stay long. - AIM (aim) He came there to look for Jane.
- APP (appuerenance, i.e., possesion in a broader
sense) John's desk - APPS (apposition) Charles the Fourth, (i.e.) the
Emperor - ATT (attitude) They were here willingly.
- BEN (benefactive) She made this for her
children. - CAUS (cause) She did so since they wanted it.
- COMPL (complement) They painted the wall blue.
- COND (condition)If they come here, we'll be
glad. - CONJ (conjunction) Jim and Jack
- CPR (comparison) taller than Jack
- CRIT (criterion) According to Jim, it was rainng
there.
13Some more TG Functors
- ID (entity) the river Thames
- LOC (locative) in Italy
- MANN (manner) They did it quickly.
- MAT (material) a bottle of milk
- MEANS (means) He wrote it by hand.
- MOD (mod) He certainly has done it.
- PAR (parentheses) He has, as we know, done it
yesterday. - PAT (patient) I saw him.
- PHR (phraseme) in no way, grammar school
- PREC (preceding, particle referring to context)
therefore, however - PRED (predicate) I saw him.
- REG (regard) with regard to George
- RHEM (rhematizer, focus sensitive particle)
only, even, also - RSTR (restrictive adjunct) a rich family
- THL (temporal-how-long ) We were there for three
weeks. - THO (temporal-how-often) We were there very
often. - TWHEN (temporal-when) We were there at noon.
14Automatic Functor Assignment
- Motivation Currently annotation done by humans,
consumes huge amounts of time of linguistic
experts - Overall goal Given an ATS, generate a TGTS
- Specific task Given a node in an ATS,
assign a tectogrammatical functor - Approach Use sentences with existing manually
derived ATSs and TGTSs to learn how to assign
tectogrammatical functors - More specifically, use machine learning to learn
rules for assigning tectogrammatical functors
15What context of a node to take into account for
AFA purposes?
a) only node U
b) whole tree
c) node U and its parent
d) node U and its siblings
16The attributes
- Lexical attributes lemmas of both G and D
nodes, and the lemma of a preposition / - subordinating conjunction that binds both
nodes, - Morphological attributes POS, subPOS,
morphological voice, morphologic case, - Analytical attributes the analytical functors of
G/D - Topological attributes number of children
(directly depending nodes) of both nodes in the
TGTS - Ontological attributes semantic position of the
node lemma within the EuroWordNet Top Ontology
17Take 1 (2000) The attributes and the class
Given
- Governing node
- Word form
- Lemma
- Full morphological tag
- Part of speech (POS) (extracted from above)
- Analytical function from ATS
- Dependent node
- Word form
- Lemma
- Full morphological tag
- POS and case (extracted from above)
- Analytical function
- Conj. or preposition between G and D node
Predict Functor of the dependent node
18Training examples
- zastavme zastavit1 vmp1avpredokamz_i
k okamz_ik nis4a n4naadvtfhl - zastavme zastavit1 vmp1avpredustanov
eni_ustanoveni_nns2a n2u adv loc - normy norma nfs2a natr
nove_ novy_ afs21a a0
atr rstr - normy norma nfs2a natr
pra_vni_ pra_vni_ afs21aa0 atr
rstr - ustanoveni_ ustanoveni_nns2a nadvnormy
norma nfs2a n2 atr pat
19Take 1 (2000) The methods used
- Machine learning Induction of decision trees
- Hand-crafted rules
- Dictionaries of unambiguous assigments
20Machine Learning - Decision Trees
- Decision trees learned using C4.5
- Only leaves with accuracy over 80 kept
- Semiautomatic transformation into Perl
- if (dep_afun"atr")
- if (conj_prep eq "o") functor"pat"
- if (conj_prep eq "v") functor"loc"
- if (conj_prep eq "z") functor"dir1"
- if (conj_prep"null")
- if (dep_case"0")
- if (dep_morph eq "a")
functor"rstr"
dep_afun atr conj_prep aby aim
(4.0/2.2) conj_prep bez acmp (2.0/1.0)
conj_prep do dir3 (11.0/3.6)
conj_prep o pat (25.0/4.9)
conj_prep v loc (35.0/6.0)
conj_prep z dir1 (35.0/3.8)
conj_prep null dep_case 0
21Hand-crafted rules
- Verbs_active if the governing node is verb
- If the analytical function is subject, then ACT
- Object in dativ, then ADDR
- Object in acusativ, then PAT
- Similar rules for verbs_passive, adjectives,
pronounsposs, numerals, pnom, pred
22Dictionaries generated from data
- Adverbs Couples adverb-functor extracted from
the training set, couples of unambigous adverbs
saved in dictionary - Prepnoun All pairs preposition-noun extracted,
unambiguous pairs that occur at least twice saved
in dictionary
23AFA Evaluation (Take 1)
- Divide existing sentences into a training (6049
nodes) and testing set (1089 nodes) to be able to
evaluate performance - 1) Only ML
- a) without pruning
- Cover 100 Precision Recall
76 - b) ML80 (after pruning of the rules
- with expected precision
worse than 80 ) - Cover 37.3 Recall 35.3
Precision94.5 - 2) Only handcrafted rules
- Cover51.2 Recall48.1
Precision93.9
24AFA Evaluation (Take 1)
- 3) ML80 hand-crafted rules dictionaries
(adverbs prepnoun) - Cover62.8 Recall58.7
Precision93.5
- When trying to assign everything, with the
available training set it is probably not
possible to reach AFA accuracy of 90 (rather 75
to 80) - ... but using a subset of the available methods,
it is possible to reach sufficient precision on
the 60 cover
25One automatically anotated TGTS (after Take 1)
- Proto je dobré seznámit se s jejich praktikami a
tak vlastne preventivne predcházet moným metodám
konkurencních firem.
26Take 2 (2002)
- Lesson from Take 1 Annotators want high recall,
even at the cost of lower precision - Consequence Use machine learning only
- More training data/annotated sentences (1536
sentences 27463 nodes in total) - Use a larger set of attributes
- Topological (number of children of G/D nodes)
- Ontological (WordNet)
- Newer version of ML SW (C5.0)
27Ontological attributes
- Semantic concepts (63) of Top Ontology in EWN
(e.g., Place, Time, Human, Group, Living, ) - For each English synset, a subset of these is
linked - Inter Lingual Index Czech lemma -gt English
synset -gt subset of semantic concepts - 63 binary attributes positive/negative relation
of Czech lemma to the respective concept TOEWN
28Methodology
29Methodology
- Evaluation of accuracy by 10-fold
cross-validation - Rules to illustrate the learned concepts
- Trees translated to Perl code included in TrEd
a tool that annotators use
30Different sets of attributes
- E-0 (empty)
- E1 Only POS E2 Only Analytical function
- E3 All morphological atts E-2
- E4 E3 Attributes of governing node
- E5 E4 funct. Words (preps./conjs.)
- E6 E5 lemmas E7 E5 EWN
- E8 E6 E7
31AFA performance
32Example rules (1)
33Example rules (2)
34Example rules (3)
35Example rules (4)
36Example rules (5)
37Example rules (6)
38Example rules ()
39Example rules (E8)
40Learning curve (for E-8)
41Using the learned AFA trees
- PDT Annotators use TrEd editor
- Learned trees transformed into Perl
- A keyboard shortcut defined in TrEd which
executes the decision tree for each node of the
TGT and assigns functors - Color coding of factors based on confidence
- Black over 90
- Red less than 60
- Blue otherwise
42Using the learned AFA trees in TrEd
43Annotators response
- Six annotators
- All agree The use of AFA significantly increases
the speed of annotation (twice as long without
it) - All annotators prefer to have as many assigned
functors as possible - They do not use the colors (even though red nodes
are corrected in 75 on unseen data) - Found some systematic errors bade by AFA
suggested the use of topological attributes
44Conclusions
- ML very helpful for annotating PDT, even though
- PDTs very close to the semantics of natural
language - Faster
- Very accurate
- Automatically assigned functors corrected in 20
of the cases - Human annotators disagree in more than 10 of the
cases - Very close to what is possible to achieve through
learning
45Further work
- Slovene Dependency Treebank
- Morphological analysis (done)
- Part-Of-Speech tagging (done)
- Parsing/grammar (only a rough draft)
- Annotation of sentences
- from Orwells 1984 (in progress)