Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Ba - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Ba

Description:

Bruxelles, 2006-03-10. Computer Aided Document Indexing System (CADIS) with Eurovoc ... manual indexing: painfully slow. Bruxelles, 2006-03-10. AIDE how? ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 20
Provided by: Jan3113
Category:

less

Transcript and Presenter's Notes

Title: Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Ba


1
Computer Aided Document Indexing System (CADIS)
with EurovocBojana Dalbelo BašicFaculty of
Electrical Engineering and ComputingUniversity
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social
SciencesUniversity of Zagrebmarko.tadic_at_ffzg.hr
2
Project AIDE
  • idea for a project
  • September 2004, conference at JRC, Ispra
  • interdisciplinary collaboration of 3 institutions
  • Croatian Information Documentation Referral
    Agency (HIDRA)
  • Department of Electronics, Microelectronics,
    Computer and Intelligent Systems (ZEMRIS)Faculty
    of Electrical Engineering and ComputingUniversity
    of Zagreb
  • Institute of Linguistics (ZZL)Faculty of
    Humanities and Social SciencesUniversity of
    Zagreb

3
AIDE collaborating institutions
  • HIDRA
  • collecting, processing, providing public access
    and promotion of the official documentation of
    the Republic of Croatia
  • coordinator Maja Cvitaš, M.A.
  • ZEMRIS
  • research in the field of artificial intelligence,
    neural networks, machine learning, data and text
    mining
  • coordinators prof. Bojana Dalbelo Bašic andJan
    Šnajder
  • ZZL
  • computational linguistic research and building
    language technologies for Croatian
  • coordinator prof. Marko Tadic

4
AIDE project objective
  • Development of intelligentsystem for automatic
    indexingof the official documentationof the
    Republic of Croatiawith descriptors from Eurovoc
    thesaurus

5
AIDE how?
  • automatic indexing, how?
  • program which learns to index
  • Joint Research Center of EC (JRC), Ispra, Italy
  • at least 10,000 manually indexed documents
  • 3-5 descriptors per document
  • 10-15 documents per descriptor
  • indexed documents stored in XML format
  • Steinberger (2003)
  • compiling a corpus of Croatian indexed documents
    for machine learning of automatic indexing with
    Eurovoc descriptors
  • situation with Croatian documentation in 2004.
  • there were only few hundreds of documents indexed
  • manual indexing painfully slow

6
AIDE how?
  • how could we speed up the manual indexing?
  • plan
  • to develop a workstation for computer aided
    document indexing
  • conduct the research and development of
    algorithms in the field of computational
    linguistics/language technologies
  • insert that knowledge in the workstation and turn
    it into Computer Aided Document Indexing System
    (CADIS)

7
CADIS two windows
Eurovoc browser window
Document window
8
Document Window
9
(No Transcript)
10
CADIS features
  • Enhanced user interface
  • list of descriptors appearing in document

11
CADIS features
  • Descriptors and non-descriptors marked in document

12
CADIS features
  • Lists of n-grams

13
CADIS features
  • Integration of corpus analysis
  • greyed n-grams are statistically relevant in the
    corpus

14
CADIS features
  • Manual marking of significant n-grams important
    step towards automatic indexing

15
Eurovoc browser window
16
Further development
  • CADIS for other languages?
  • already for Croatian and English
  • usable for other languages without linguistic
    module
  • cooperation needed with respective language
    technology experts for development of linguistic
    module for other languages
  • partners for EU project proposals for the next
    step
  • AIDE
  • research on machine learning and text-mining
  • use that knowledge to turn the workstation into
    an intelligent system for Automatic Indexing of
    Documents with Eurovoc
  • establishing the publicly accessible service for
    automatic indexing of the official documentation
    of the Republic of Croatia

17
http//textmining.zemris.fer.hr
18
Conclusion
  • CADIS is unique in Europe
  • Web info at
  • HIDRA www.hidra.hr/hidra/aide/aide.htm
  • ZEMRIS textmining.zemris.fer.hr
  • for download contact bojana.dalbelo_at_fer.hr

19
Computer Aided Document Indexing System (CADIS)
with EurovocBojana Dalbelo BašicFaculty of
Electrical Engineering and ComputingUniversity
of Zagrebbojana.dalbelo_at_fer.hrMarko
TadicFaculty of Humanities and Social
SciencesUniversity of Zagrebmarko.tadic_at_ffzg.hr
Write a Comment
User Comments (0)
About PowerShow.com