Title: Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content
1Using Human Language Technology for Automatic
Annotation and Indexing of Digital Library
Content
- Kalina Bontcheva, Diana Maynard, Hamish
Cunningham, Horacio Saggion - University of Sheffield
2The Challenge
- Lower the cost of annotating document collections
with metadata and semantic information - New ways to access digital collections via
indexes of events, people, etc. - The solution use Human Language Technology (HLT)
which requires little or no adaptation to the
types of texts being processed
3(Semi-)Automatic Annotation with Semantic
Information
- Old Bailey 18th century English Collection
4Indexing and Search by Semantic Content
5Information Extraction Technology
- Identify named entities (domain independent)
- Persons
- Dates
- Numbers
- Organizations
- Identify domain-specific events and terms
- Players
- Teams
- Events goal, foul, etc
6(No Transcript)
7Question
- Which of these tools and Human Language
Technology (HLT) can I use in other digital
library applications? - Without modification in any domain
- With domain-specific customisations
8Domain-Independent Named Entity Recognition
- Specifically designed for many genres and domains
- Work on a variety of document formats
- Person names, dates, numbers, organisations,
monetary expressions, etc. - Annotations can be exported as document markup
(e.g. XML) for further processing and/or storage
or indexed in Oracle - Multilingual support via Unicode
- Support for distributed documents, e.g., WWW
9Domain-Independent Named Entity Recognition(2)
- Low-overhead customisation possible by
non-computer scientists - Used successfully in a number of projects,
including adapted to new languages Bengali,
Bulgarian, etc. - Publically available, Java-based modules at
gate.ac.uk as part of Sheffields General
Architecture for Text Enginnering (GATE)
10Name Entity Annotated Example
- lthtmlgt
- lttitlegtPresident visitlt/titlegt
- ltbodygt
- ltpersongtPresident Bushlt/persongt will visit
ltlocationgtCanadalt/locationgt in the
ltdategtJunelt/dategt. ltpersongtBushlt/persongt is
expected to - lt/bodygt
- lt/htmlgt
11Correcting the Computers Mistakes
- Less time-consuming than full manual annotation
- 85-90 correctness are sometimes enough
12Other Human Language Technology
- Automatic speech recognition can be used in
combination with IE to annotate sound/video
material results improved with training - Domain-specific terms and events can be annotated
by modifying the linguistic resources of the IE
modules or training them on human-marked texts
13Building and Customising HLT Modules for New
Domains/Applications
- Facilitated by existing tools such as the
graphical development environment provided by
GATE - GATE comes with a useful starting set
- Tokeniser
- Gazetteer list lookup
- Sentence detection module
- Part-of-speech tagging module
- A pattern-matching engine with grammars
- Information Retrieval support, etc.
- Try for free from http//gate.ac.uk
14Why Are Digital Libraries Good for HLT?
- Digital libraries are challenging for HLT as they
require robustness and scalability - Cultural heritage DLs are particularly
challenging as they pose new types of problems - Example Nouns in 18th century English texts were
capitalised so the NE recogniser had to deal with
less reliable orthographic information
15Further information
- Demos contact me during a coffee break
- E-mail kalina_at_dcs.shef.ac.uk
- Web http//gate.ac.uk
- Try NE recognition online
- http//gate.ac.uk/annie/index.jsp