Next Steps Technical Details - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Next Steps Technical Details

Description:

Addressing the Language Barrier Problem in the Enlarged EU ... PRINCIPLE OF SUBSIDIARITY /descriptor /assignment JRC-Ispra, 17.09.04, Slide 16 ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 25
Provided by: ralf99
Category:

less

Transcript and Presenter's Notes

Title: Next Steps Technical Details


1
Next Steps / Technical Details Bruno Pouliquen
Ralf Steinberger Addressing the Language
Barrier Problem in the Enlarged EU Automatic
Eurovoc Descriptor Assignment JRC Workshop,
Ispra, 16/17 September 2004 http//www.jrc.cec.eu.
int/langtech
2
Eurovoc indexing Extend language coverage
  • Czech
  • Croatian
  • Latvian
  • Lithuanian
  • Polish
  • Slovak
  • Analysis
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • (Greek)
  • Italian
  • Portuguese
  • Spanish
  • Swedish
  • (Lithuanian)
  • (Bulgarian)
  • (Hungarian)
  • Soon also
  • Albanian
  • Romanian
  • Russian
  • Slovene

3
Incentive for collaboration
  • Mutual benefit
  • We can provide tools and results to you (to
    non-commercial Member State organisations)
  • JRC will be able to Eurovoc-index documents for
    news analysis, etc.
  • No payments by the JRC are foreseen
  • How to go ahead? / What to do next?
  • We need Eurovoc-indexed texts in your
    languages(or translations of Eurovoc-indexed
    texts!) (Acquis Communautaire)

4
Format to provide training texts to the JRC
  • Ideally
  • Plain text (not MS-Word, RTF, PDF, etc.)
  • UTF-8 character encoding
  • With CELEX code
  • With Eurovoc descriptor code (mentioning Eurovoc
    version)
  • XML format, structured
  • Linguistically pre-processed and structured
  • lemmatised
  • annexes / signatures separate
  • title separate
  • stop word lists
  • MANY texts
  • 80,000 English texts were enough to train ca.
    3500 descriptors (out of 6000)!

5
Descriptor distribution in Spanish EP/EC texts
6
Descriptor distribution in Spanish EP/EC texts
7
Descriptor distribution in Spanish Congress texts
8
Descriptor distribution in Hungarian texts
9
Procedure
  • You provide us with
  • A big XML file containing the documents
  • A stop word list
  • We will give back to you
  • A subset of documents (evaluation set)
  • Same format
  • Additional information on automatic Eurovoc
    descriptors assigned
  • Some statistics on descriptor usage frequency,
    etc.
  • An online browser interface to see the assignment
    results
  • A validation interface

10

ltxmlgt ltassignmentgt Eurovoc Assignment lt/assignme
ntgt lt/xmlgt
export
11
XML format
12
(No Transcript)
13
(No Transcript)
14
Results of descriptor assignment - interface
15
Results of descriptor assignment - XML
ltassignmentgt ltdescriptor ID"1006020102000000"
COSINE"0.20" OKAPI"8.83"gt PRESIDENCY OF THE
EC COUNCILlt/descriptorgt ltdescriptor
ID"1016030000000000" COSINE"0.17"
OKAPI"9.08"gt EUROPEAN UNIONlt/descriptorgt ltdescr
iptor ID"1006040100000000" COSINE"0.15"
OKAPI"9.63"gt PRESIDENTlt/descriptorgt ltdescriptor
ID"2826020000000000" COSINE"0.14"
OKAPI"7.82"gt SOCIAL POLICYlt/descriptorgt ltdescri
ptor ID"1011020102000000" COSINE"0.14"
OKAPI"8.22"gt PRINCIPLE OF SUBSIDIARITYlt/descrip
torgt ... lt/assignmentgt
16
Results of descriptor assignment - validation
Numeric feedback?
17
Arranging the collaboration of scientific partners
  • The JRC will be able to provide the tool and
    indexing results.
  • The JRC does not have specific funds to pay for
    this work.
  • Possibilities for collaboration between
    parliament and scientists
  • informal collaboration without payment
  • formal collaboration (contract, payment)
  • apply for a project with national or EU funding
    (example Hungary)
  • M.Sc. Theses (e.g. Lithuanian), internships (e.g.
    Estonian),
  • We would like to have lemmatisers for the new
    languages. ?
  • If necessary, we can train system without
    linguistic pre-processing.

18
Pre-processing of the texts (by scientists?)
  • Linguistic pre-processing, needed for each
    language
  • General and corpus-specific list of stop words
    (several thousand!)
  • For highly inflected languages some lemmatiser
    or stemmer
  • Multi-word term mark-up for disambiguation
    purposes?
  • Further text processing
  • Some document structuring to separate title,
    text, footer and annex
  • Conversion to XML
  • Conversion to UTF-8

19
Dealing with different versions of Eurovoc
  • Problem has not yet been solved request for your
    input
  • En training material was indexed with versions
    3.1 and 4
  • Challenge new descriptors need new training
    material ? delay
  • Re-training required

20
Dealing with different versions of Eurovoc (2)
  • Case 1 New descriptor
  • Search old and new documents for related
    documents for re-training
  • Case 2 New name for old descriptor
  • Replace the descriptor name OLD_NAME ?
    NEW_NAME
  • Case 3 New place in hierarchy
  • No problem
  • Case 4 Disappearing descriptor
  • Will no longer be assigned

21
Dealing with different versions of Eurovoc (2)
  • Case 5 Several descriptors are conflated
  • No problem
  • Case 6 A descriptor is split into two or more
  • Re-training required(see Case 1)

NEW_NAME_1 OLD_NAME
NEW_NAME_2 NEW_NAME_3
22
Dealing with different versions of Eurovoc (3)
  • Changes between Eurovoc versions should not only
    be described in free text.
  • They should be formalised in a machine-readable
    way(e.g. in XML, in table format, ).
  • This should be done centrally for the thesaurus
    (i.e. for all thesaurus languages), rather than
    separately for each language!

23
Appeal to Eurovoc community / EP / OPOCE
  • Make Eurovoc available to the wide public in
    machine-readable form
  • Formalise the version differences (e.g. XML)
  • Make Eurovoc-indexed texts available to the
    scientific community
  • Controlled by licences, if necessary
  • E.g. via the Evaluations and Language resources
    Distribution Agency ELDA
  • See http//www.elda.fr
  • ELDA handles the practical and legal issues
    related to the distribution of language
    resources, provides legal advice in the field of
    HLT, and drafts and concludes distribution
    agreements on behalf of ELRA.
  • Wealth of parallel texts to train multilingual
    text analysis applications
  • Machine Translation
  • Multilingual Named Entity Recognition
  • Multilingual classification
  • Multi-document summarisation
  • Automatic indexing
  • The benefit is yours!

24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com