Title: Computer%20Aided%20Document%20Indexing%20System%20for%20Accessing%20Legislation%20A%20Joint%20Venture%20of%20Flanders%20and%20Croatia%20Bojana%20Dalbelo%20Ba
1Computer Aided Document Indexing System for
Accessing LegislationA Joint Venture of Flanders
and CroatiaBojana Dalbelo BašicFaculty of
Electrical Engineering and Computing, University
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social Sciences,
University of Zagrebmarko.tadic_at_ffzg.hrMarie-Fr
ancine MoensCentre for Law and IT / Dept. of
Computer Science, Katholieke Universiteit
2Talk overview
- document indexing and computer aided document
indexing - project AIDE
- CADIS workstation features
- project CADIAL
- eCADIS workstation additional features
- machine learning techniques
- future developments
- conclusions
3Computer Aided Document Indexing
- document indexing
- attachment of descriptors from a controlled
thesaurus to a document - descriptors labels representing the content of
a document - necessary for document retrieval in many document
collections - parliamentary documentation
- legislation
- technical documentation
- usually done manually
- tedious, error prone, slow (max. 30-40
documents/day) - could computers be of any help in this process?
- if we build a Computer Aided Document Indexing
System (CADIS)
4Project AIDE in Croatia
- idea for a project
- September 2004
- interdisciplinary collaboration of 3 institutions
- Croatian Information Documentation Referral
Agency (HIDRA) - Department of Electronics, Microelectronics,
Computer and Intelligent Systems (ZEMRIS)Faculty
of Electrical Engineering and ComputingUniversity
of Zagreb - Institute of Linguistics (ZZL)Faculty of
Humanities and Social SciencesUniversity of
5AIDE collaborating institutions
- collecting, processing, providing public access
and promotion of the official documentation of
the Republic of Croatia - coordinator Maja Cvitaš, M.A.
- research in the field of artificial intelligence,
neural networks, machine learning, data and text
mining - coordinators prof. Bojana Dalbelo Bašic and Jan
Å najder, M.Sc. - ZZL
- computational linguistic research and building
language technologies for Croatian - coordinator prof. Marko Tadic
6AIDE project objective
- Development of intelligentsystem for automatic
indexingof the official documentationof the
Republic of Croatiawith descriptors from Eurovoc
7AIDE how?
- AIDE Automatic Indexing of Documents with
Eurovoc - automatic indexing, how?
- program which learns to index documents
- conference in Joint Research Center of EC (JRC),
Ispra, Italy, 2004-09 - at least 10,000 manually indexed documents
- 3-5 descriptors per document
- 10-15 documents per descriptor
- indexed documents stored in XML format
- Steinberger (2003)
- compiling a corpus of Croatian manually indexed
documents for machine learning of automatic
indexing with Eurovoc descriptors - situation with Croatian documentation in 2004-09
- there were only few hundreds of documents indexed
- manual indexing painfully slow
- how could we speed up the manual indexing?
8AIDE activities
- investigate and develop algorithms in the field
of computational linguistics/language
technologies - include that knowledge into the Computer Aided
Document Indexing System (CADIS) - demonstration of CADIS in European parliament
9CADIS two parallel windows
Eurovoc browser window
Document window
10Document Window
11(No Transcript)
12CADIS features
- Enhanced user interface
- list of descriptors literary appearing in document
13CADIS features
- Descriptors and non-descriptors marked in document
14CADIS features
15CADIS features
- Integration of corpus analysis
- greyed n-grams are statistically relevant in the
corpus i.e. collocations
16CADIS features
- Manual marking of significant n-grams
- important step towards further refinment of
automatic indexing
17Eurovoc browser window
18AIDE activities
- investigate and develop algorithms in the field
of computational linguistics/language
technologies - include that knowledge into the Computer Aided
Document Indexing System (CADIS) - demonstration of CADIS in European parliament
- ca 10,000 Croatian documents indexed in HIDRA
using CADIS workstation during 2006 - joint project proposal with Katholieke
Universiteit Leuven for CADIAL project
19CADIAL project
- Computer Aided Document Indexing for Accessing
Legislation - a joint Flemish-Croatian project
- Department International Flanders, grant no.
KRO/009/06 - partners
- Katholieke Universiteit Leuven (prof.
Marie-Francine Moens) - University of Zagreb, Hidra (prof. Bojana Dalbelo
Bašic) - started 2007-03
- duration 2 years
- web www.cadial.org
- the goal publicly accessible service for
automatic indexing of the official documentation
of the Republic of Croatia - new version of CADIS (eCADIS) is one of modules
in this project - planned as a web-based service
20CADIAL project 2
- used the 10,000 manually indexed documents to
train the system for automatic indexing of
documents in Croatian - used the 20,000 manually indexed documents from
Acquis to train the system for automatic indexing
of documents in English - included that training data into the next
version eCADIS (?-version)
21eCADIS (?) features
- Automatic suggestion of relevant descriptorsi.e.
automatic indexing - application of machine learning techniques
22eCADIS (?) features
- Compare it to manually attached indexes
23eCADIS (?) features
- Manual marking of inappropriate suggestions
- another step in further refinment of automatic
24eCADIS (?) on document in English
25eCADIS (?) on document in English
- Automatic suggestion of relevant descriptorsi.e.
automatic indexing
26eCADIS (?) on document in English
- Compare it to manually attached indexes
27Training the classifiers
- already existing classifiers
- profile classifier (Steinberger 2003)
- K-nearest neighbours
- binary classifiers
- SVM, Logistic Regression, Rocchio, Bayes,
- classifiers used for the preliminary training
- ca 3500 independent binary classifiers
- need to be further evaluated
- Logistic Regression used for 10,000 documents in
Croatian - SVM used for 20,000 documents in English
- features
- tokens, lemmas, stems, character n-grams
- various feature selection methods and their
combinations ?2, ig, mi
28Further development of eCADIS
- training with new features and feature selection
methods - collocations, word n-grams, chunks
- new measures for evaluation of results
- sensitive to thesaurus hierarchy
- web-interface for eCADIS for inclusion into the
CADIAL system - eCADIS for other languages
- now only Croatian and English (?-version) covered
- usable for other languages as it is, but without
the linguistic module less efficient - no list of lemmas, but types
- poor statistics for n-grams
- cooperation with language technology experts in
different languages for development of linguistic
29Further development of eCADIS
- eCADIS for other languages
- training the automatic indexing system for other
languages - enables automatic suggestions of relevant
descriptors in new, unseen documents - analysis of manual markings
- descriptors, word n-grams, suggestions
- promote the use of eCADIS in other countries
beyond the scope of CADIAL project - e.g. Belgium (Flanders)
- linguistic module for Dutch and French needed
- computational lingustics expertise
- training data from Acquis can be used to make an
automatic indexing system for Dutch and French - machine learning expertise
- a joint Flemish-Croatian project sponsored by
Flemish government - better public access to Croatian official
documentation - faster and improved document indexing
- automatic content metadata generation (Semantic
Web) - easier document retrieval and exploration of
legislation - multilingual access via standardized EU thesaurus
Eurovoc - a test-case for the usage of such a system in
Flanders - Web information on CADIAL project and eCADIS
- www.cadial.org
- contact
- bojana.dalbelo_at_fer.hr
- marie-france.moens_at_law.kuleuven.ac.be
31Computer Aided Document Indexing System for
Accessing LegislationA Joint Venture of Flanders
and CroatiaBojana Dalbelo BašicFaculty of
Electrical Engineering and Computing, University
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social Sciences,
University of Zagrebmarko.tadic_at_ffzg.hrMarie-Fr
ancine MoensCentre for Law and IT / Dept. of
Computer Science, Katholieke Universiteit