Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content - PowerPoint PPT Presentation

About This Presentation

Title:

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content

Description:

New ways to access digital collections via indexes of events, people, etc. ... Indexing and Search by Semantic Content. Information Extraction Technology ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 16

Provided by: markn159

Category:

more less

Transcript and Presenter's Notes

Title: Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content

1
Using Human Language Technology for Automatic
Annotation and Indexing of Digital Library
Content

Kalina Bontcheva, Diana Maynard, Hamish
Cunningham, Horacio Saggion
University of Sheffield

2
The Challenge

Lower the cost of annotating document collections
with metadata and semantic information
New ways to access digital collections via
indexes of events, people, etc.
The solution use Human Language Technology (HLT)
which requires little or no adaptation to the
types of texts being processed

3
(Semi-)Automatic Annotation with Semantic
Information

Old Bailey 18th century English Collection

4
Indexing and Search by Semantic Content
5
Information Extraction Technology

Identify named entities (domain independent)
Persons
Dates
Numbers
Organizations
Identify domain-specific events and terms
Players
Teams
Events goal, foul, etc

6
(No Transcript)
7
Question

Which of these tools and Human Language
Technology (HLT) can I use in other digital
library applications?
Without modification in any domain
With domain-specific customisations

8
Domain-Independent Named Entity Recognition

Specifically designed for many genres and domains
Work on a variety of document formats
Person names, dates, numbers, organisations,
monetary expressions, etc.
Annotations can be exported as document markup
(e.g. XML) for further processing and/or storage
or indexed in Oracle
Multilingual support via Unicode
Support for distributed documents, e.g., WWW

9
Domain-Independent Named Entity Recognition(2)

Low-overhead customisation possible by
non-computer scientists
Used successfully in a number of projects,
including adapted to new languages Bengali,
Bulgarian, etc.
Publically available, Java-based modules at
gate.ac.uk as part of Sheffields General
Architecture for Text Enginnering (GATE)

10
Name Entity Annotated Example

lthtmlgt
lttitlegtPresident visitlt/titlegt
ltbodygt
ltpersongtPresident Bushlt/persongt will visit
ltlocationgtCanadalt/locationgt in the
ltdategtJunelt/dategt. ltpersongtBushlt/persongt is
expected to
lt/bodygt
lt/htmlgt

11
Correcting the Computers Mistakes

Less time-consuming than full manual annotation
85-90 correctness are sometimes enough

12
Other Human Language Technology

Automatic speech recognition can be used in
combination with IE to annotate sound/video
material results improved with training
Domain-specific terms and events can be annotated
by modifying the linguistic resources of the IE
modules or training them on human-marked texts

13
Building and Customising HLT Modules for New
Domains/Applications

Facilitated by existing tools such as the
graphical development environment provided by
GATE
GATE comes with a useful starting set
Tokeniser
Gazetteer list lookup
Sentence detection module
Part-of-speech tagging module
A pattern-matching engine with grammars
Information Retrieval support, etc.
Try for free from http//gate.ac.uk

14
Why Are Digital Libraries Good for HLT?

Digital libraries are challenging for HLT as they
require robustness and scalability
Cultural heritage DLs are particularly
challenging as they pose new types of problems
Example Nouns in 18th century English texts were
capitalised so the NE recogniser had to deal with
less reliable orthographic information

15
Further information