Title: Olga Pustylnikov, Alexander Mehler
1A Unified Database of Dependency
TreebanksIntegrating, Quantifying
EvaluatingDependency Data
- Olga Pustylnikov, Alexander Mehler
- Bielefeld University
2Motivation
- Exploring similarities among languages by means
of syntactic treebanks - We collected a database covering 11 languages
- Treebanks have been developed separately by
different research projects - quantitative investigations on these treebanks -gt
the need for unification
3Motivation
corpus
structure
annotation
(loves v ( (John n) (Mary n) )
loves
John loves Mary
1 John n 2 2 loves v 0 3 Mary n 2
Mary
John
ltS ID"1"gt ltW DOM"1" ID"2"gt John lt/Wgt ltW
DOM"_root" ID"2"gt loves lt/Wgt ltW DOM"2"
ID"3 Mary lt/Wgt lt/Sgt
4Motivation
Demands on the unified format of treebanks
- () generic allowing to represent as many
treebanks as possible - () extensible to new treebanks
- () complete preserving all corpus specific
information - () transferable to other kinds of corpora
- () complex exhibiting the minimal
- complexity
- -gt graph representations
5Motivation
GXL (Holt et al., 2006)
- Graph eXtensible Language is a graph model
representig corpora in terms of graphs
XML
Multimodal Data
GXL
TOOLS
eGXL
WIKI
Treebanks
Treebanks
- GXL can be applied to any kinds of corpora. (See
e.g. Mehler and Gleim (2005), Ferrer i Cancho et
al. (2007), Pustylnikov and Mehler (2008))
6Agenda
7eGXL
2-level data model
Types
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
IDREF
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
Sentences
8eGXL
2-level data model
Types
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
IDREF
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
Sentences
9The eGXL Types-graph
- The Types-graph contains treebank specific
attributes (e.g.POS, morphological attribute
etc.) -gt nodes - Each instance of an attribute is given a unique
identifier
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
a unique identifier
a unique identifier
the value of the attribute
the value of the attribute
10The eGXL Sentences-graph
vill
.
Detta
bestämt
jag
bemöta
each token of a treebank
each token of a treebank
an IDREF to the POS-node of the Types-graph
an IDREF to the POS-node of the Types-graph
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
word form
word form
a (syntactic) relation
a (syntactic) relation
from (e.g. a head verb)
from (e.g. a head verb)
to (e.g. a dependent argument)
to (e.g. a dependent argument)
11The eGXL Sentences-graph
vill
.
Detta
bestämt
jag
bemöta
node each token of a treebank
id a unique identifier
form word form
pos an IDREF to the POS-node of the Types-graph
rel a (syntactic) relation
relend a relation anchor
in from (e.g. a head verb)
out to (e.g. a dependent argument)
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
12eGXL
13Agenda
1411 Dependency Treebanks
7 different formats
15Input vs. Output Formats
- Examples from Dutch, Swedish, Italian treebanks
16Unification is possible
- due to the separation of the core from the
secondary parts
ltgraph idTypesgt ltnode idPOS /gt ltnode
idt245 nameVERB /gt lt/graphgt
diversity
ltgraph id"Sentences"gt ltgraph id"g8"gt ltnode
id"s8_1" form"Detta" pos"t151" /gt ltnode
id"s8_2" form"vill" pos"t245" /gt ...
ltrelgt ltrelend direction"in" target"s8_2"
/gt ltrelend direction"out" target"s8_1" /gt
lt/relgt ... lt/graphgt
commonality
17The TreebankWiki
- http//ariadne.coli.uni-bielefeld.de/wikis/treeban
kwiki/
18Agenda
19Complexity of eGXL
- Logical Scalling Factor (LSF) number of logical
elements (e.g. XML-element) required to represent
a treebank unit (e.g. a word form, POS etc.)
node
rel
other
eGXL
other
eGXL
20Agenda
21DTDB
22Agenda
23Conclusions
- a database covering 11 languages
- eGXL a generic XML graph model adopted to
syntactic treebanks - use of treebanks within a single application
(Ariadne) - olga.pustylnikov_at_uni-bielefeld.de
- alexander.mehler_at_uni-bielefeld.de
- ruediger.gleim_at_uni-bielefeld.de
- SFB 673
- Thank you for your attention!