Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content - PowerPoint PPT Presentation

About This Presentation
Title:

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content

Description:

New ways to access digital collections via indexes of events, people, etc. ... Indexing and Search by Semantic Content. Information Extraction Technology ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 16
Provided by: markn159
Category:

less

Transcript and Presenter's Notes

Title: Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content


1
Using Human Language Technology for Automatic
Annotation and Indexing of Digital Library
Content
  • Kalina Bontcheva, Diana Maynard, Hamish
    Cunningham, Horacio Saggion
  • University of Sheffield

2
The Challenge
  • Lower the cost of annotating document collections
    with metadata and semantic information
  • New ways to access digital collections via
    indexes of events, people, etc.
  • The solution use Human Language Technology (HLT)
    which requires little or no adaptation to the
    types of texts being processed

3
(Semi-)Automatic Annotation with Semantic
Information
  • Old Bailey 18th century English Collection

4
Indexing and Search by Semantic Content
5
Information Extraction Technology
  • Identify named entities (domain independent)
  • Persons
  • Dates
  • Numbers
  • Organizations
  • Identify domain-specific events and terms
  • Players
  • Teams
  • Events goal, foul, etc

6
(No Transcript)
7
Question
  • Which of these tools and Human Language
    Technology (HLT) can I use in other digital
    library applications?
  • Without modification in any domain
  • With domain-specific customisations

8
Domain-Independent Named Entity Recognition
  • Specifically designed for many genres and domains
  • Work on a variety of document formats
  • Person names, dates, numbers, organisations,
    monetary expressions, etc.
  • Annotations can be exported as document markup
    (e.g. XML) for further processing and/or storage
    or indexed in Oracle
  • Multilingual support via Unicode
  • Support for distributed documents, e.g., WWW

9
Domain-Independent Named Entity Recognition(2)
  • Low-overhead customisation possible by
    non-computer scientists
  • Used successfully in a number of projects,
    including adapted to new languages Bengali,
    Bulgarian, etc.
  • Publically available, Java-based modules at
    gate.ac.uk as part of Sheffields General
    Architecture for Text Enginnering (GATE)

10
Name Entity Annotated Example
  • lthtmlgt
  • lttitlegtPresident visitlt/titlegt
  • ltbodygt
  • ltpersongtPresident Bushlt/persongt will visit
    ltlocationgtCanadalt/locationgt in the
    ltdategtJunelt/dategt. ltpersongtBushlt/persongt is
    expected to
  • lt/bodygt
  • lt/htmlgt

11
Correcting the Computers Mistakes
  • Less time-consuming than full manual annotation
  • 85-90 correctness are sometimes enough

12
Other Human Language Technology
  • Automatic speech recognition can be used in
    combination with IE to annotate sound/video
    material results improved with training
  • Domain-specific terms and events can be annotated
    by modifying the linguistic resources of the IE
    modules or training them on human-marked texts

13
Building and Customising HLT Modules for New
Domains/Applications
  • Facilitated by existing tools such as the
    graphical development environment provided by
    GATE
  • GATE comes with a useful starting set
  • Tokeniser
  • Gazetteer list lookup
  • Sentence detection module
  • Part-of-speech tagging module
  • A pattern-matching engine with grammars
  • Information Retrieval support, etc.
  • Try for free from http//gate.ac.uk

14
Why Are Digital Libraries Good for HLT?
  • Digital libraries are challenging for HLT as they
    require robustness and scalability
  • Cultural heritage DLs are particularly
    challenging as they pose new types of problems
  • Example Nouns in 18th century English texts were
    capitalised so the NE recogniser had to deal with
    less reliable orthographic information

15
Further information
  • Demos contact me during a coffee break
  • E-mail kalina_at_dcs.shef.ac.uk
  • Web http//gate.ac.uk
  • Try NE recognition online
  • http//gate.ac.uk/annie/index.jsp
Write a Comment
User Comments (0)
About PowerShow.com