Title: Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Ba
1Computer Aided Document Indexing System (CADIS)
with EurovocBojana Dalbelo BašicFaculty of
Electrical Engineering and ComputingUniversity
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social
SciencesUniversity of Zagrebmarko.tadic_at_ffzg.hr
2Project AIDE
- idea for a project
- September 2004, conference at JRC, Ispra
- interdisciplinary collaboration of 3 institutions
- Croatian Information Documentation Referral
Agency (HIDRA) - Department of Electronics, Microelectronics,
Computer and Intelligent Systems (ZEMRIS)Faculty
of Electrical Engineering and ComputingUniversity
of Zagreb - Institute of Linguistics (ZZL)Faculty of
Humanities and Social SciencesUniversity of
Zagreb
3AIDE collaborating institutions
- HIDRA
- collecting, processing, providing public access
and promotion of the official documentation of
the Republic of Croatia - coordinator Maja Cvitaš, M.A.
- ZEMRIS
- research in the field of artificial intelligence,
neural networks, machine learning, data and text
mining - coordinators prof. Bojana Dalbelo Bašic andJan
Šnajder - ZZL
- computational linguistic research and building
language technologies for Croatian - coordinator prof. Marko Tadic
4AIDE project objective
- Development of intelligentsystem for automatic
indexingof the official documentationof the
Republic of Croatiawith descriptors from Eurovoc
thesaurus
5AIDE how?
- automatic indexing, how?
- program which learns to index
- Joint Research Center of EC (JRC), Ispra, Italy
- at least 10,000 manually indexed documents
- 3-5 descriptors per document
- 10-15 documents per descriptor
- indexed documents stored in XML format
- Steinberger (2003)
- compiling a corpus of Croatian indexed documents
for machine learning of automatic indexing with
Eurovoc descriptors - situation with Croatian documentation in 2004.
- there were only few hundreds of documents indexed
- manual indexing painfully slow
6AIDE how?
- how could we speed up the manual indexing?
- plan
- to develop a workstation for computer aided
document indexing - conduct the research and development of
algorithms in the field of computational
linguistics/language technologies - insert that knowledge in the workstation and turn
it into Computer Aided Document Indexing System
(CADIS)
7CADIS two windows
Eurovoc browser window
Document window
8Document Window
9(No Transcript)
10CADIS features
- Enhanced user interface
- list of descriptors appearing in document
11CADIS features
- Descriptors and non-descriptors marked in document
12CADIS features
13CADIS features
- Integration of corpus analysis
- greyed n-grams are statistically relevant in the
corpus
14CADIS features
- Manual marking of significant n-grams important
step towards automatic indexing
15Eurovoc browser window
16Further development
- CADIS for other languages?
- already for Croatian and English
- usable for other languages without linguistic
module - cooperation needed with respective language
technology experts for development of linguistic
module for other languages - partners for EU project proposals for the next
step - AIDE
- research on machine learning and text-mining
- use that knowledge to turn the workstation into
an intelligent system for Automatic Indexing of
Documents with Eurovoc - establishing the publicly accessible service for
automatic indexing of the official documentation
of the Republic of Croatia
17http//textmining.zemris.fer.hr
18Conclusion
- CADIS is unique in Europe
- Web info at
- HIDRA www.hidra.hr/hidra/aide/aide.htm
- ZEMRIS textmining.zemris.fer.hr
- for download contact bojana.dalbelo_at_fer.hr
19Computer Aided Document Indexing System (CADIS)
with EurovocBojana Dalbelo BašicFaculty of
Electrical Engineering and ComputingUniversity
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social
SciencesUniversity of Zagrebmarko.tadic_at_ffzg.hr