Computer%20Aided%20Document%20Indexing%20System%20for%20Accessing%20Legislation%20A%20Joint%20Venture%20of%20Flanders%20and%20Croatia%20Bojana%20Dalbelo%20Ba - PowerPoint PPT Presentation

About This Presentation
Title:

Computer%20Aided%20Document%20Indexing%20System%20for%20Accessing%20Legislation%20A%20Joint%20Venture%20of%20Flanders%20and%20Croatia%20Bojana%20Dalbelo%20Ba

Description:

A Joint Venture of Flanders and Croatia. Bojana Dalbelo Ba ic ... International Flanders, grant no. ... a test-case for the usage of such a system in Flanders ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Computer%20Aided%20Document%20Indexing%20System%20for%20Accessing%20Legislation%20A%20Joint%20Venture%20of%20Flanders%20and%20Croatia%20Bojana%20Dalbelo%20Ba


1
Computer Aided Document Indexing System for
Accessing LegislationA Joint Venture of Flanders
and CroatiaBojana Dalbelo BašicFaculty of
Electrical Engineering and Computing, University
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social Sciences,
University of Zagrebmarko.tadic_at_ffzg.hrMarie-Fr
ancine MoensCentre for Law and IT / Dept. of
Computer Science, Katholieke Universiteit
Leuvenmarie-francine.moens_at_law.kuleuven.ac.be
2
Talk overview
  • document indexing and computer aided document
    indexing
  • project AIDE
  • CADIS workstation features
  • project CADIAL
  • eCADIS workstation additional features
  • machine learning techniques
  • future developments
  • conclusions

3
Computer Aided Document Indexing
  • document indexing
  • attachment of descriptors from a controlled
    thesaurus to a document
  • descriptors labels representing the content of
    a document
  • necessary for document retrieval in many document
    collections
  • parliamentary documentation
  • legislation
  • technical documentation
  • usually done manually
  • tedious, error prone, slow (max. 30-40
    documents/day)
  • could computers be of any help in this process?
  • if we build a Computer Aided Document Indexing
    System (CADIS)

4
Project AIDE in Croatia
  • idea for a project
  • September 2004
  • interdisciplinary collaboration of 3 institutions
  • Croatian Information Documentation Referral
    Agency (HIDRA)
  • Department of Electronics, Microelectronics,
    Computer and Intelligent Systems (ZEMRIS)Faculty
    of Electrical Engineering and ComputingUniversity
    of Zagreb
  • Institute of Linguistics (ZZL)Faculty of
    Humanities and Social SciencesUniversity of
    Zagreb

5
AIDE collaborating institutions
  • HIDRA
  • collecting, processing, providing public access
    and promotion of the official documentation of
    the Republic of Croatia
  • coordinator Maja Cvitaš, M.A.
  • ZEMRIS
  • research in the field of artificial intelligence,
    neural networks, machine learning, data and text
    mining
  • coordinators prof. Bojana Dalbelo Bašic and Jan
    Šnajder, M.Sc.
  • ZZL
  • computational linguistic research and building
    language technologies for Croatian
  • coordinator prof. Marko Tadic

6
AIDE project objective
  • Development of intelligentsystem for automatic
    indexingof the official documentationof the
    Republic of Croatiawith descriptors from Eurovoc
    thesaurus

7
AIDE how?
  • AIDE Automatic Indexing of Documents with
    Eurovoc
  • automatic indexing, how?
  • program which learns to index documents
  • conference in Joint Research Center of EC (JRC),
    Ispra, Italy, 2004-09
  • at least 10,000 manually indexed documents
  • 3-5 descriptors per document
  • 10-15 documents per descriptor
  • indexed documents stored in XML format
  • Steinberger (2003)
  • compiling a corpus of Croatian manually indexed
    documents for machine learning of automatic
    indexing with Eurovoc descriptors
  • situation with Croatian documentation in 2004-09
  • there were only few hundreds of documents indexed
  • manual indexing painfully slow
  • how could we speed up the manual indexing?

8
AIDE activities
  • investigate and develop algorithms in the field
    of computational linguistics/language
    technologies
  • include that knowledge into the Computer Aided
    Document Indexing System (CADIS)
  • demonstration of CADIS in European parliament
    (2006-03-10)

9
CADIS two parallel windows
Eurovoc browser window
Document window
10
Document Window
11
(No Transcript)
12
CADIS features
  • Enhanced user interface
  • list of descriptors literary appearing in document

13
CADIS features
  • Descriptors and non-descriptors marked in document

14
CADIS features
  • Lists of n-grams

15
CADIS features
  • Integration of corpus analysis
  • greyed n-grams are statistically relevant in the
    corpus i.e. collocations

16
CADIS features
  • Manual marking of significant n-grams
  • important step towards further refinment of
    automatic indexing

17
Eurovoc browser window
18
AIDE activities
  • investigate and develop algorithms in the field
    of computational linguistics/language
    technologies
  • include that knowledge into the Computer Aided
    Document Indexing System (CADIS)
  • demonstration of CADIS in European parliament
    (2006-03-10)
  • ca 10,000 Croatian documents indexed in HIDRA
    using CADIS workstation during 2006
  • joint project proposal with Katholieke
    Universiteit Leuven for CADIAL project

19
CADIAL project
  • Computer Aided Document Indexing for Accessing
    Legislation
  • a joint Flemish-Croatian project
  • Department International Flanders, grant no.
    KRO/009/06
  • partners
  • Katholieke Universiteit Leuven (prof.
    Marie-Francine Moens)
  • University of Zagreb, Hidra (prof. Bojana Dalbelo
    Bašic)
  • started 2007-03
  • duration 2 years
  • web www.cadial.org
  • the goal publicly accessible service for
    automatic indexing of the official documentation
    of the Republic of Croatia
  • new version of CADIS (eCADIS) is one of modules
    in this project
  • planned as a web-based service

20
CADIAL project 2
  • used the 10,000 manually indexed documents to
    train the system for automatic indexing of
    documents in Croatian
  • used the 20,000 manually indexed documents from
    Acquis to train the system for automatic indexing
    of documents in English
  • included that training data into the next
    version eCADIS (?-version)

21
eCADIS (?) features
  • Automatic suggestion of relevant descriptorsi.e.
    automatic indexing
  • application of machine learning techniques

22
eCADIS (?) features
  • Compare it to manually attached indexes

23
eCADIS (?) features
  • Manual marking of inappropriate suggestions
  • another step in further refinment of automatic
    indexing

24
eCADIS (?) on document in English
25
eCADIS (?) on document in English
  • Automatic suggestion of relevant descriptorsi.e.
    automatic indexing

26
eCADIS (?) on document in English
  • Compare it to manually attached indexes

27
Training the classifiers
  • already existing classifiers
  • profile classifier (Steinberger 2003)
  • K-nearest neighbours
  • binary classifiers
  • SVM, Logistic Regression, Rocchio, Bayes,
  • classifiers used for the preliminary training
  • ca 3500 independent binary classifiers
  • need to be further evaluated
  • Logistic Regression used for 10,000 documents in
    Croatian
  • SVM used for 20,000 documents in English
  • features
  • tokens, lemmas, stems, character n-grams
  • various feature selection methods and their
    combinations ?2, ig, mi

28
Further development of eCADIS
  • training with new features and feature selection
    methods
  • collocations, word n-grams, chunks
  • new measures for evaluation of results
  • sensitive to thesaurus hierarchy
  • web-interface for eCADIS for inclusion into the
    CADIAL system
  • eCADIS for other languages
  • now only Croatian and English (?-version) covered
  • usable for other languages as it is, but without
    the linguistic module less efficient
  • no list of lemmas, but types
  • poor statistics for n-grams
  • cooperation with language technology experts in
    different languages for development of linguistic
    modules

29
Further development of eCADIS
  • eCADIS for other languages
  • training the automatic indexing system for other
    languages
  • enables automatic suggestions of relevant
    descriptors in new, unseen documents
  • analysis of manual markings
  • descriptors, word n-grams, suggestions
  • promote the use of eCADIS in other countries
    beyond the scope of CADIAL project
  • e.g. Belgium (Flanders)
  • linguistic module for Dutch and French needed
  • computational lingustics expertise
  • training data from Acquis can be used to make an
    automatic indexing system for Dutch and French
  • machine learning expertise

30
Conclusion
  • CADIAL
  • a joint Flemish-Croatian project sponsored by
    Flemish government
  • better public access to Croatian official
    documentation
  • faster and improved document indexing
  • automatic content metadata generation (Semantic
    Web)
  • easier document retrieval and exploration of
    legislation
  • multilingual access via standardized EU thesaurus
    Eurovoc
  • a test-case for the usage of such a system in
    Flanders
  • Web information on CADIAL project and eCADIS
  • www.cadial.org
  • contact
  • bojana.dalbelo_at_fer.hr
  • marie-france.moens_at_law.kuleuven.ac.be

31
Computer Aided Document Indexing System for
Accessing LegislationA Joint Venture of Flanders
and CroatiaBojana Dalbelo BašicFaculty of
Electrical Engineering and Computing, University
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social Sciences,
University of Zagrebmarko.tadic_at_ffzg.hrMarie-Fr
ancine MoensCentre for Law and IT / Dept. of
Computer Science, Katholieke Universiteit
Leuvenmarie-francine.moens_at_law.kuleuven.ac.be
Write a Comment
User Comments (0)
About PowerShow.com