Encoding Croatian Corpora - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Encoding Croatian Corpora

Description:

pubDate 1999-12-20 /pubDate /publicationStmt sourceDesc biblStruct monogr ... pubDate 1999-03-11 /pubDate /imprint /monogr /biblStruct ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 38
Provided by: marko5
Category:

less

Transcript and Presenter's Notes

Title: Encoding Croatian Corpora


1
Encoding Croatian Corpora
  • Marko Tadic(marko.tadic_at_ffzg.hr,
    www.hnk.ffzg.hr/mt)
  • Department of linguistics/Institute of
    linguistics, Faculty of philosophy, University of
    Zagreb (www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm)
  • Tübingen, 2001-02-22

2
Lecture plan
  • Monolingual corpora
  • Croatian National Corpus (HNK)
  • Bilingual corpora
  • Croatian-English parallel corpus
  • Croatian-Slovenian parallel corpus
  • Acquis translations parallel corpus

3
Croatian National Corpus (HNK) 1
  • project of the Ministry of Science and Technology
    of the Republic of Croatia 130718, Computational
    processing of Croatian language, formally started
    1996, actually 1998
  • theoretical foundations (www.hnk.ffzg.hr/cilj) in
    1995, published
  • Tadic (1996) Racunalna obradba hrvatskoga i
    nacionalni korpus, Suvremena lingvistika 41-42,
    603-612
  • Tadic (1998) Raspon, opseg i sastav korpusa
    suvremenoga hrvatskoga jezika, Filologija 30-31,
    337-347
  • need for the reference corpus of Croatian
  • 1st step written
  • later some 10 spoken
  • a tentative solution for its composition
  • the size, time-span and structure was elaborated
  • accessibility via WWW service was suggested

4
HNK 2 structure
  • 30m30-million Corpus of Contemporary Croatian
  • texts from 1990 until today
  • different domains and genres
  • representativeness for contemporary Croatian
    standard
  • HETACroatian Electronic Text Archive (Hrvatski
    Elektronski Tekstovni Arhiv)
  • whole texts older than 1990
  • whole texts of complete publications after 1990
    which would disbalance the representativeness of
    30m

5
HNK 3 30m text typology
  • 1. Informative texts/Faction 76 22800000
  • 1.1. newspaper 37 11100000
  • 1.1.1. daily 22 6600000
  • 1.1.2. weekly 9 2700000
  • 1.1.3. bi-weekly 3 900000
  • 1.1.4. irregular 3 900000
  • 1.2. magazines 17 5100000
  • 1.2.1. weekly 10 3000000
  • 1.2.2. bi-weekly 1 300000
  • 1.2.3. monthly 3 900000
  • 1.2.4. bi/tri-monthly 3 900000
  • 1.3. books 22 6600000
  • 1.3.1. journalism 7 2100000
  • 1.3.2. crafts etc. 2 600000
  • 1.3.3. science 13 3900000
  • 2. Imaginative teksts/Fiction 21 6300000
  • 2.1. prose 21 6300000
  • 2.1.1. novels 13 3900000
  • 2.1.2. stories 7 2100000

6
HNK 4 corpus on www http//www.hnk.ffzg.hr
  • Testing V 1.0 1998-12-05
  • 30m 3 mW
  • Testing V 1.1 1999-02-14 1999-07-20
  • 30m 7,67 mW
  • HETA 2,9 mW from CD-ROM Classics of Croatian
    literature, Naklada Bulaja, Zagreb, 1999
  • Testing V 1.1 (approx. 10 mW) of corpus is www
    accessible
  • text format quasi HTML, no XML
  • no POS marking
  • Testing V 1.2 (approx. 17 mW)
  • being filled right now
  • no additional retrieval facilities

7
HNK 5 Statistics
  • www.hnk.ffzg.hr/stats

8
HNK 6 text conversion and encoding
  • XML
  • XCES (XML version of CES)
  • Ide, Bonhomme Romary (2000)
  • DIVs, Ps, Ws
  • S-boundary detection algorithm
  • problem with ordinal numbers written with
    punctuation
  • input text formats
  • WWW HTML, XML
  • DTP RTF, DOC, QXD, WP, TXT etc.
  • conversion
  • 2XML custom made software
  • input HTML, RTF / output XML, no header
  • two-step conversion by user-defined scripts
  • enables high level of automation

9
HNK 7 corpus format 1
ltlt?xml version"1.0"?gt lt!DOCTYPE cesDoc PUBLIC
"-//CES//DTD XML cesDoc//EN"
"xcesDoc.dtd" gt
ltcesDoc version"3.19"gt ltcesHeader type"text"
version"3.19"gt ltfileDescgt
lttitleStmtgt lth.titlegtElectronic version
of Vecernji list, vl990311lt/h.titlegt
ltrespStmtgt ltrespTypegtXCES markup
prepared bylt/respTypegt ltrespNamegtBosko
Bekavaclt/respNamegt lt/respStmtgt
lt/titleStmtgt ltextentgt
ltwordCountgt4456lt/wordCountgt
ltbyteCountgt25385lt/byteCountgt lt/extentgt
ltpublicationStmtgt ltdistributorgtProject
MZT RH 130718lt/distributorgt
ltpubAddressgtInstitute of linguisticslt/pubAddressgt
lttelephonegt385 1 6120-142lt/telephonegt
ltfaxgt385 1 6856-118lt/faxgt
lteAddressgthttp//www.ffzg.hr/zzl/zzl-home.htmlt/eAd
dressgt ltidnogt76676665676lt/idnogt
ltavailability status"free"gt
lt/availabilitygt ltpubDategt1999-12-20lt/pu
bDategt lt/publicationStmtgt
ltsourceDescgt ltbiblStructgt
ltmonogrgt lth.titlegtVecernji
listlt/h.titlegt lth.authorgtlt/h.authorgt
ltimprintgt
ltpubPlacegtZagreblt/pubPlacegt
ltpublishergtVecernji listlt/publishergt
ltpubDategt1999-03-11lt/pubDategt
lt/imprintgt lt/monogrgt
lt/biblStructgt lt/sourceDescgt
lt/fileDescgt ltencodingDescgt
ltprojectDescgtCroatian National Corpus is being
collected in the Institute of linguistics,
Faculty of Philosophy, University of Zagreb in
the frame of the project "Computer processing of
Croatian language" granted by the Ministry of
Science and Technology of Republic of Croatia
under No. 130718lt/projectDescgt
lt/encodingDescgt ltprofileDescgt
ltlangUsagegt ltlanguage id"hr"
iso639"hr"gtCroatianlt/languagegt
lt/langUsagegt lttextClassgt ltcatRef
target"xxxxx"gtlt/catRefgt lth.keywordsgt ltkeyTermgt
Newspaperlt/keyTermgt lt/h.keywordsgt
lt/textClassgt lt/profileDescgt lt/cesHeadergt gt
10
HNK 7 corpus format 2
ltBODYgt ltDIV0 type"article"gt ltHEAD type"nn"gtU
GORICI SVETOJANSKOJ ODRAN 12. FESTIVAL PJEVACA
AMATERAlt/HEADgt ltHEAD type"na"gtIvana osvojila
upanijski Sanremolt/HEADgt ltHEAD type"pn"gt Od
20 natjecatelja iri je najboljom proglasio Ivanu
Erdeljac s pjesmom "Crazy", druga je Antonija
Mikita s pjesmom "To", a trece je mjesto osvojila
Ksenija Cveteticlt/HEADgt ltFIGUREgtPublici su se
najvie svidjeli Marija alic i Petar
Puhijeralt/FIGUREgt ltPgtPod medijskim
pokroviteljstvom "Vecernjeg lista" i Radio Jaske,
a uz pomoc DIR "Rubinic" kao generalnog te jo
sedamdesetak drugih sponzora, u petak i u subotu
u Gorici Svetojanskoj pokraj Jastrebarskog odran
je 12. festival pjevaca amatera.lt/Pgt ltPgtPrve
festivalske veceri, na kojoj su nastupila 22
izvodaca do 15 godina, prvu nagradu strucnog
irija odnijela je Petra Batelja iz Rastoka
pokraj Jaske za pjesmu "To malo ljubavi". Druga
nagrada pripala je Nikolini Oslakovic iz Gornje
Reke za pjesmu "Neka mi ne svane", a treca Mariji
Jurini iz Desinca za pjesmu "Ginem". Publika je
najboljom ocijenila svetojansku grupu "Mrvice" s
pjesmom "Mrvica", dok je drugu nagradu dodijelila
Natali Rajnovic iz Jaske za pjesmu "Don"t ever
cry", a trecu Aniti Oslakovic iz Desinca za
pjesmu "Malo fali". Za najboljeg debitanta prve
veceri proglaena je Irena Kian iz Zdencine s
pjesmom "Izdali me".lt/Pgt ltPgtDruga vecer - s
dvadeset starijih izvodaca iz Jaske, Karlovca,
Bjelovara, Zagreba i Velike Gorice - bila je
osobito napeta, jer je za razliku od lani
ponudila vrlo kvalitetne izvodace i
interpretacije pa nije bilo lako odabrati
najbolje.lt/Pgt ltPgtNakon podue stanke tijekom
koje su izbrojani glasovi - a koju su publici
kratili gost veceri Ivo Pattiera te sastav "Santa
Anna" i solistica Goga Copic - proglaeni su
ovogodinji pobjednici. Prema ocjeni strucnog
irija, prvu nagradu i zlatnu plaketu
"Vecernjaka" dobila je Karlovcanka Ivana Erdeljac
za vrlo dobro otpjevanu pjesmu "Crazy". Druga
nagrada pripala je Antoniji Mikiti iz Velike
Gorice za pjesmu "To", a treca Kseniji Cvetetic
iz Petrovine za pjesmu "Neka mi ne svane".lt/Pgt
ltPgtPublika je najvie glasova dodijelila
svetojansko-zagrebackom duetu Mariji alic i
Petru Puhijeri za interpretaciju pjesme "Ima li
nade za nas", pa je i njima pripala
"Vecernjakova" zlatna plaketa. Na drugo mjesto
publika je svrstala "Svetojanske tamburae" koji
su nastupili s pjesmom "Dobro jutro", a na trece
Zagrepcanku Marijanu Parilac i pjesmu "Idi i ne
budi ljude".lt/Pgt ltPgtNajboljom debitanticom
zavrne veceri proglaena je Zagrepcanka Marina
Posilovic s pjesmom "Pii, pii mi", a nagradu za
najbolji scenski nastup dobio je sastav iz
Petrovine "Prigorje de lajt" s pjesmom "Oj
suseda, suseda". Cini se da su ovogodinje
nagrade - a bilo ih je doista mnogo, od
sedmodnevnog boravka u Opatiji, umjetnicke slike,
bicikla i kazetofona do satova i poklon-bonova -
zavrile u pravim rukama. Oni koji ih nisu
dobili, a moda su ih takoder zasluili, neka se
ovaj put utjee pljeskom publike, a dogodine ce
imati novu priliku. Jer, tradicija Svetojanskog
festivala - svojevrsnog Sanrema zagrebacke
upanije - nastavlja se.lt/Pgt ltBYLINEgtN.
Godrijan-Videclt/BYLINEgt lt/DIV0gt lt/BODYgt
11
HNK 8 corpus format 3
ltBODYgt vl990301gr01 1 X ltDIV0 type"article"gt vl9
90301gr01 7 X ltHEAD type"nn"gt vl990301gr01 28 X U
vl990301gr01 44 R GORICI vl990301gr01 46 R SVET
OJANSKOJ vl990301gr01 53 R ODR381AN vl990301gr
01 66 R 12 vl990301gr01 78 B . vl990301gr01 80 I
FESTIVAL vl990301gr01 82 R PJEVA268A vl990301
gr01 91 R AMATERA vl990301gr01 104 R lt/HEADgt vl9
90301gr01 111 X ltHEAD type"na"gt vl990301gr01 118
X Ivana vl990301gr01 134 R osvojila vl990301gr01
140 R 382upanijski vl990301gr01 149 R Sanremo
vl990301gr01 165 R lt/HEADgt vl990301gr01 172 X ltH
EAD type"pn"gt vl990301gr01 179 X vl990301gr01
195 I Od vl990301gr01 197 R 20 vl990301gr01 200
B natjecatelja vl990301gr01 203 R 382iri vl990
301gr01 216 R je vl990301gr01 226 R najboljom vl
990301gr01 229 R proglasio vl990301gr01 239 R Iva
nu vl990301gr01 249 R Erdeljac vl990301gr01 255
R s vl990301gr01 264 R pjesmom vl990301gr01 266
R " vl990301gr01 275 I Crazy vl990301gr01 276 R
" vl990301gr01 281 I , vl990301gr01 282 I druga
vl990301gr01 284 R je vl990301gr01 290 R Antonij
a vl990301gr01 293 R Mikita vl990301gr01 302 R s
vl990301gr01 309 R pjesmom vl990301gr01 311 R
  • tokenization
  • TOKENIZER custom madesoftware
  • input XML
  • output 1 tabbed file fordata-base input
  • output 2 tokenized XML

12
HNK 9 corpus format 4
  • output 2 tokenized XML

ltW type"R"gtmedijskimlt/Wgt ltW type"R"gtpokrovitelj
stvomlt/Wgt ltW type"I"gt"lt/Wgt ltW
type"R"gtVecernjeglt/Wgt ltW type"R"gtlistalt/Wgt ltW
type"I"gt"lt/Wgt ltW type"R"gtilt/Wgt ltW
type"R"gtRadiolt/Wgt ltW type"R"gtJaskelt/Wgt ltW
type"I"gt,lt/Wgt ltW type"R"gtalt/Wgt ltW
type"R"gtuzlt/Wgt ltW type"R"gtpomoclt/Wgt ltW
type"R"gtDIRlt/Wgt ltW type"I"gt"lt/Wgt ltW
type"R"gtRubiniclt/Wgt ltW type"I"gt"lt/Wgt ltW
type"R"gtkaolt/Wgt ltW type"R"gtgeneralnoglt/Wgt ltW
type"R"gttelt/Wgt ltW type"R"gtjolt/Wgt ltW
type"R"gtsedamdesetaklt/Wgt ltW type"R"gtdrugihlt/Wgt
ltW type"R"gtsponzoralt/Wgt ltW type"I"gt,lt/Wgt ltW
type"R"gtult/Wgt ltW type"R"gtpetaklt/Wgt ltW
type"R"gtilt/Wgt ltW type"R"gtult/Wgt ltW
type"R"gtsubotult/Wgt ltW type"R"gtult/Wgt ltW
type"R"gtGoricilt/Wgt ltW type"R"gtSvetojanskojlt/Wgt
ltW type"R"gtpokrajlt/Wgt
ltW type"I"gt"lt/Wgt ltW type"I"gt,lt/Wgt ltW
type"R"gtdrugalt/Wgt ltW type"R"gtjelt/Wgt ltW
type"R"gtAntonijalt/Wgt ltW type"R"gtMikitalt/Wgt ltW
type"R"gtslt/Wgt ltW type"R"gtpjesmomlt/Wgt ltW
type"I"gt"lt/Wgt ltW type"R"gtTolt/Wgt ltW
type"I"gt"lt/Wgt ltW type"I"gt,lt/Wgt ltW
type"R"gtalt/Wgt ltW type"R"gttrecelt/Wgt ltW
type"R"gtjelt/Wgt ltW type"R"gtmjestolt/Wgt ltW
type"R"gtosvojilalt/Wgt ltW type"R"gtKsenijalt/Wgt
ltW type"R"gtCveteticlt/Wgt lt/HEADgt ltFIGUREgt ltW
type"R"gtPublicilt/Wgt ltW type"R"gtsult/Wgt ltW
type"R"gtselt/Wgt ltW type"R"gtnajvielt/Wgt ltW
type"R"gtsvidjelilt/Wgt ltW type"R"gtMarijalt/Wgt ltW
type"R"gtaliclt/Wgt ltW type"R"gtilt/Wgt ltW
type"R"gtPetarlt/Wgt ltW type"R"gtPuhijeralt/Wgt
lt/FIGUREgt ltPgt ltW type"R"gtPodlt/Wgt
ltBODYgt ltDIV0 type"article"gt ltHEAD type"nn"gt
ltW type"R"gtUlt/Wgt ltW type"R"gtGORICIlt/Wgt ltW
type"R"gtSVETOJANSKOJlt/Wgt ltW type"R"gtODRANlt/Wgt
ltW type"B"gt12lt/Wgt ltW type"I"gt.lt/Wgt ltW
type"R"gtFESTIVALlt/Wgt ltW type"R"gtPJEVACAlt/Wgt
ltW type"R"gtAMATERAlt/Wgt lt/HEADgt ltHEAD
type"na"gt ltW type"R"gtIvanalt/Wgt ltW
type"R"gtosvojilalt/Wgt ltW type"R"gtupanijskilt/Wgt
ltW type"R"gtSanremolt/Wgt lt/HEADgt ltHEAD
type"pn"gt ltW type"I"gtlt/Wgt ltW type"R"gtOdlt/Wgt
ltW type"B"gt20lt/Wgt ltW type"R"gtnatjecateljalt/Wgt
ltW type"R"gtirilt/Wgt ltW type"R"gtjelt/Wgt ltW
type"R"gtnajboljomlt/Wgt ltW type"R"gtproglasiolt/Wgt
ltW type"R"gtIvanult/Wgt ltW type"R"gtErdeljaclt/Wgt
ltW type"R"gtslt/Wgt ltW type"R"gtpjesmomlt/Wgt ltW
type"I"gt"lt/Wgt ltW type"R"gtCrazylt/Wgt
13
HNK 10 POS annotation 1
  • Croatian
  • morphologically rich language
  • nouns 7 cases, 2 numbers, 3 genders
  • adjectives 2 forms (definite indefinite), 3
    grades in comparation
  • adverbs 3 grades in comparation
  • pronouns 7 cases, 2 numbers, 3 genders, 3
    persons
  • numbers 7 cases, 3 genders
  • verbs
  • 2 numbers, 3 persons
  • 3 simple, 3 periphrastic tenses (with difference
    in 3 genders and 2 numbers in participles)
  • 2 additional participles
  • 2 conditionals
  • imperative
  • very complex system of aspects (perfect
    imperfect/iterative)
  • a lot of syntactic relations coded by morphology
  • POS annotation and lemmatization more important
    than for e.g. English

14
HNK 11 POS annotation 2
abeceda Ncfsn abecede abeceda Ncfsg abecedi
abeceda Ncfsd abecedu abeceda Ncfsa abecedo
abeceda Ncfsv abecedi abeceda Ncfsl abecedom
abeceda Ncfsi abecede abeceda Ncfpn abeceda
abeceda Ncfpg abecedama abeceda Ncfpd abecede
abeceda Ncfpa abecede abeceda Ncfpv abecedama
abeceda Ncfpl abecedama abeceda Ncfpi abolicija
Ncfsn abolicije abolicija Ncfsg aboliciji
abolicija Ncfsd aboliciju abolicija Ncfsa
abolicijo abolicija Ncfsv aboliciji abolicija
Ncfsl abolicijom abolicija Ncfsi abolicije
abolicija Ncfpn abolicija abolicija
Ncfpg abolicijama abolicija Ncfpd abolicije
abolicija Ncfpa abolicije abolicija
Ncfpv abolicijama abolicija Ncfpl abolicijama
abolicija Ncfpi abrazija Ncfsn abrazije
abrazija Ncfsg abraziji abrazija Ncfsd abraziju
abrazija Ncfsa abrazijo abrazija Ncfsv abraziji
abrazija Ncfsl abrazijom abrazija Ncfsi abrazije
abrazija Ncfpn abrazija abrazija Ncfpg abrazijama
abrazija Ncfpd abrazije abrazija Ncfpa abrazije
abrazija Ncfpv abrazijama abrazija
Ncfpl abrazijama abrazija Ncfpi adaptacija
Ncfsn adaptacije adaptacija Ncfsg adaptaciji
adaptacija Ncfsd adaptaciju adaptacija Ncfsa
adaptacijo adaptacija Ncfsv adaptaciji
adaptacija Ncfsl adaptacijom adaptacija
Ncfsi adaptacije adaptacija Ncfpn adaptacija
adaptacija Ncfpg adaptacijama adaptacija
Ncfpd adaptacije adaptacija Ncfpa adaptacije
adaptacija Ncfpv adaptacijama adaptacija
Ncfpl adaptacijama adaptacija Ncfpi
  • Croatian morphological lexicon
  • 36000 headwords
  • GenOblik2 morphological generatorTadic (1994)
  • MulTextEast MSD recommendation
  • 6 CEE languages
  • Croatian specification added in 1998
  • Erjavec MulTextEast recommendation V 2.0 ?
  • matching with corpus

15
HNK 12 POS annotation 3
16
HNK 13 POS annotation 4
  • automatically anotate 1Mw corpus
  • manual correction
  • use it as training data for tagger
  • TNT

17
Parallel corpora
  • Croatian-English parallel corpus
  • Slovene-Croatian parallel corpus
  • Acquis translations corpus

18
HR-EN parallel corpus 1
  • source Croatia Weekly
  • like USA today different domains
  • politics, economy and finance, tourism, ecology,
    culture, art, events, sports
  • 12 pages, A3
  • prepared in Croatian then translated by
    professional translating office
  • availability
  • 118 numbers
  • started January 1998, finished May 2000
  • access to all texts in electronic form in both
    languages

19
HR-EN parallel corpus 2
  • Articles 4,343
  • Sentences
  • HR 67,694 (15.59 s/article avg.)
  • EN 75,390 (17.36 s/article avg.)
  • Tokens
  • HR 1,490,964 (22.03 w/s avg.)
  • EN 1,796,744 (23.83 w/s avg.)
  • Total 3,287,708

20
HR-EN parallel corpus 3
21
HR-EN parallel corpus 4
  • Sentence marking
  • lt/SgtltSgt insertion after punctuation followed by
    capital letter
  • filtered for known exceptions Mr., Mrs., Miss.,
    dr., St. etc.
  • problem of ordinal numbers written with
    punctuation by Croatian orthography
  • Vanilla aligner
  • alignments
  • 01 310 in 235 articles 0.45
  • 10 25 in 12 articles 0.04
  • 11 56783 in 4143 articles 84.12
  • 12 8611 in 3288 articles 12.76
  • 21 1391 in 1012 articles 2.06
  • 22 379 in 345 articles 0.56
  • Total alignments 67499 in 4143 articles

22
HR-EN parallel corpus 5
  • encoding problem How to store alignments?
  • Tadic (2000) LREC2000
  • (X)CES way
  • each language in a separate document
  • ltS id...gt
  • pointers to IDs of aligned sentences in 3rd
    document

23
HR-EN parallel corpus 6
24
Acquis translations parallel corpus
  • Croatia is on the way of becaming a Candidate
    country for EU
  • Translation of AC only task equal to all
    Candidate countries
  • translating 200.000 pages of EU OJ into Croatian
    (ca 60 Mw)
  • translating 100.000 pages of Croatian legislation
    in English/French...
  • Ministry of European integration of the Republic
    of Croatia
  • organizing the translation process
  • 200 freelance translators or translation
    companies
  • existing on-line lexical dBases (CELEX...) no
    Croatian terms and/or TE
  • mantain the consistency of translations?
  • EuroVoc translated in Croatian
  • thesaurus of European Commision terms
  • Institute of linguistics
  • proposal for joint project of preparation of AC
    texts for translation
  • term marking found in EuroVoc and TE suggestion

25
AC translations parallel corpus 3
26
AC translations parallel corpus
27
AC translations parallel corpus 5
28
AC translations parallel corpus 6
29
AC translations parallel corpus 7
  • if we put ltSgts and lt/Sgts and give them
    ID-attributes in both original and translation we
    can use the whole of AC as a huge Translation
    memory
  • parallel corpus aligned at the ltSgt level TM
  • just a matter of encoding
  • alignment and/or ltTUgt marking
  • term marking
  • ltWgt-level marking needed
  • several encoding solutions

30
AC translations parallel corpus 8
  • solution 1 term tags intermixed with corpus data
  • ltPgt
  • ltSgt
  • ltW id845gtThelt/Wgt
  • lttermgtltW id846gtEuropeanlt/Wgt
  • ltW id847gtParliamentlt/Wgtlt/termgt
  • ltW id848gtmaylt/Wgt
  • ltW id849gtasklt/Wgt...
  • lt/Sgt...
  • lt/Pgt...
  • problem non-contiguous multi-W terminological
    units

31
AC translations parallel corpus 9
  • solution 2 term marking in stand-off annotation
    i.e. in other XML document linked to corpus data
  • ltPgt
  • ltSgt
  • ltW id845gtThelt/Wgt ltW id765gtEuropskilt/Wgt
  • ltW id846gtEuropeanlt/Wgt ltW id766gtbilt/Wgt
  • ltW id847gtParliamentlt/Wgt ltW id767gtparlamentlt/Wgt
  • ltW id848gtmaylt/Wgt ltW id768gtmogaolt/Wgt
  • ltW id849gtasklt/Wgt... ltW id769gttraitilt/Wgt
  • lt/Sgt...
  • lt/Pgt...
  • ltterm_unit iden122gt ltterm unit idhr345gt
  • ltlink xtargets"846 847"gt ltlink
    xtargets"765 767"gt
  • lt/term_unitgt lt/term unitgt
  • allows marking of non-contiguous terms

32
AC translations parallel corpus 10
  • solution 3 term marking with translation
    equivalent suggestion
  • ltPgt
  • ltSgt
  • ltW id845gtThelt/Wgt ltW id765gtEuropskilt/Wgt
  • ltW id846gtEuropeanlt/Wgt ltW id766gtbilt/Wgt
  • ltW id847gtParliamentlt/Wgt ltW id767gtparlamentlt/Wgt
  • ltW id848gtmaylt/Wgt ltW id768gtmogaolt/Wgt
  • ltW id849gtasklt/Wgt... ltW id769gttraitilt/Wgt
  • lt/Sgt...
  • lt/Pgt...
  • ltterm_unit iden122gt ltterm unit idhr345gt
  • ltlink xtargets"846 847"gt ltlink
    xtargets"765 767"gt
  • lt/term_unitgt lt/term unitgt
  • lttugtltlink xtargets"en122 hr345"gtlt/tugt

33
AC translations parallel corpus 11
  • XLink
  • W3C Working Draft, 2000-02-21 (http//www.w3.org/T
    R/xlink)
  • XMLs powerful linking tool
  • allows stand-off annotation (Ide et al. 2000)
  • no changes in corpus data lt annotation of
    read-only data
  • multimodal corpora annotation
  • time-line links
  • links of language data with audio or video
    (paralinguistic data)
  • Systems using XLink intensively
  • MATE workbench (McKelvie et al. 2000)
  • LDC (Bird Liberman 2000)
  • ...

34
Some methodological remarks 1
  • some skepticism
  • what do we do exactly by putting annotations in
    corpora?
  • adding the secondary data to our primary data in
    order to able to retrieve information later
  • adding categories selected from the prepared list
    and applying them to our corpus data
  • not concerned here with meta-description (usually
    in headers)
  • secondary data result of interpretation of
    primary data
  • by adding already prepared categories
  • we get a lot of information which could not be
    collected any other way
  • could we miss some phenomena which we havent
    forseen in the stage of category preparation?

35
Some methodological remarks 2
  • example on the very basic level of word boundary
  • nmkojo, zam. pridj. nijedan, nikakav
  • (Anic, Vladimir Rjecnik hrvatskoga jezika, 1991)
  • Ni u kojem se slucaju ne smije okrenuti!
  • oligo- and poly-sacharids...
  • Ivan je ikic radosno krenuo nizbrdo.
  • How many words do we have here?
  • Is it a trivial question?
  • opposition between graphic words and lemmas
  • not to mention syntax and/or semantics

36
Some methodological remarks 3
  • putting only one kind of secondary/interpretive
    data in corpus
  • filtering only those linguistic phenomena which
    we are able to grasp by our already prepared
    categories
  • missing phenomena for which we are not prepared
  • keeping our secondary/tertiary/... data apart
    from basic resource data
  • allows other researchers to have their own
    secondary etc. data and different interpretations
  • allows us to compare different interpretive data
    interpersonally and/or automatically
  • XML and concept of stand-off annotation gives us
    a tool for that

37
Encoding Croatian Corpora
  • Marko Tadic(marko.tadic_at_ffzg.hr,
    www.hnk.ffzg.hr/mt)
  • Department of linguistics/Institute of
    linguistics, Faculty of philosophy, University of
    Zagreb (www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm)
  • Tübingen, 2001-02-22
Write a Comment
User Comments (0)
About PowerShow.com