Title: THE DIATHESIS NEWSPAPER DIGITIZATION SUITE
1THE DIATHESIS NEWSPAPER DIGITIZATION SUITE
- Foundation of Research and Technology
- Institute of Computer Science
- Centre for Cultural Informatics
Heraklion, Crete, Greece
Martin Doerr, Georgios Markakis, Maria Theodoridou
2About DIATHESIS
- Diathesis is a newspaper digitization suite whose
primary purpose is the digitization,
classification and dissemination of archival
newspaper material. - It was originally used for the digitization of
the Vikelaia Municipal Librarys newspaper
collection (1890-1960) at Heraclion, Crete. It
has evolved as an independent digitization suite
since. - Used in other projects as well (Filekpedeytiki
Etairia Athens, Greece The AYGHI newspaper)
3The Problem
- Historical newspapers are one of the most
signicant source of information for researchers
due to the wealth of information they provide
regarding every aspect of everyday political,
social and intellectual life. - Access to this type of archival material is
usually obstructed by the following factors - In order to protect the archival material from
potential damage some archives prohibit the
access to the largest part of their collection. - Direct contact with the original archival
material constitutes a potential health hazard
(due to dust and fungi). - The lack of indexes to newspapers combined with
the vastness of information contained in them
makes research a very time consuming task. - Many archives adopted digitization of newspapers
as a straightforward method to deal with the
above problems. Digitized material is easier to
preserve and much easier to distribute via the
Web. - However, conversion of archival material into a
digital image format (i.e. JPEG, TIFF, PDF or
DJVU) does not solve the problem of rapid access
to this material. - Digitization itself is inadequate if it does not
provide the means of rapidly accessing the
digitized material in a timely and accurate
manner (also known as the searchability issue).
4Current State of the Art newspaper Digitization
Practices
- Currently there are three main approaches for
rendering newspaper archival material searchable - The Physical Features Based Approach
- The OCR Based Full Text Indexing Approach.
- The Conceptual Classification (Ontology Based)
Approach.
5The physical features based classification
approach.
- Newspapers are classified using a basic set of
metadata regarding physical features of the
original material (number of issue, date of
publication, newspaper name, number of pages
etc). - Advantages
- Simple to implement.
- Disadvantages
- The final user is unable to conduct full-text
searches on an article or issue level basis. - The final outcome of the digitization effort
resembles more a browsing mechanism. - There is no explicitly defined conceptual
structure of the archive. - Institutions
- Anno Austrian newspapers online project
(http//deposit.ddb.de/online/exil/exil.htm). - Exilpresse digital. deutsche exilzeitschriften
1933-1945" project (http//deposit.ddb.de/online/e
xil/exil.htm). - Denmark Digitaliserede danske aviser 1759-1865
(http//www.statsbiblioteket.dk).
6The OCR based Full Text Indexing Approach.
- Automatic digitization approaches that make use
of OCR analysis of digitized newspapers. Full
Text Indexing techniques are currently considered
to be the state of the art in the area of
newspaper digitization and this is mainly for the
following reasons - Creation of searchable full - text index via OCR
is a much faster process compared to the manual
creation of metadata. - Separation of searchability and readability.
- It is possible to conduct searches at a
page/issue/article level basis. - The search is conducted via keywords in a manner
that is familiar to the average user of
contemporary Web Search engines. - Efficient content dissemination over the Web.
- Disadvantages
- Well known precision/recall issues.
- Newspaper archives are not as chaotic as the Web.
- The search of information in OCR based
information retrieval systems is conceptually
blind. - The import process a computationally expensive
procedure.
7The OCR based Full Text Indexing Approach.
- Institutions adopting this approach
- British library online newspaper archive
(http//www.uk.olivesoftware.com/). - The Brooklyn Daily Eagle online
(http//www.brooklynpubliclibrary.org/eagle/). - Northern New Nork historical newspapers
(http//news.nnyln.net/). - Utah Digital Newspapers (http//www.lib.utah.edu/d
igital/unews/). - Historical newspapers in Washington
(http//www.secstate.wa.gov/history/newspapersname
.aspx). - To mention just a few
8The conceptual classification approach.
- The conceptual classification approach overcomes
many of the above weaknesses by enabling the user
to perform a knowledge engineering task upon the
already digitized material via the use of
ontologies. - An Ontology "the specifcation of ones
conceptualization of a knowledge domain". - Advantages
- Ontologies are used to express a specific
conceptual view over the digitized material. - The use of top level ontologies guarantees to a
certain extent the semantic interoperability
among different archives. - The user may use concepts that classify the
document that are not initially contained within
the document itself. - Disadvantages
- Given the density of information in a newspaper,
production of metadata is a notoriously time
consuming task (knowledge engineering
bottleneck). - It is almost impossible to manually define all
the semantic relations or entities contained even
in a single article in a timely manner.
9The DIATHESIS Approach a hybrid approach
- This system attempts to implement a realistic
conceptual classification approach by combining
the best elements from the three approaches
mentioned above - It permits searches on a newspaper issue basis
(newspaper issue name, number, publication date)
in a similar manner to the physical features
based approach. - It permits searches on an article level basis via
the use of full text queries in a similar manner
to the OCR based Full Text Indexing Approach. - It permits searches on an article level basis via
the semantic relationships assigned to each
segment. - It permits searches that combine all of the above
elements. - The system DOES not attempt to create a complete
semantic structure that includes all the semantic
relationships and entities (Actors, Places)
described in the text. Instead it focuses to the
creation a coherent semantic backbone that can be
easily enriched with semantic relations. - DIATHESIS is using CIDOC CRM as an underlying
ontology.
10Aims of DIATHESIS
- To render the digitized newspapers searchable on
a document/article level basis. - To exploit the use of OCR technology in order to
enable full text search in a newspaper
collection. - To combine full text search with user-defined
metadata based search on a document and article
level basis in order to enhance the overall
precision factor of the system. - To provide visualization facilities and an
ergonomic interface for - The timely completion of metadata according to a
set of predefined thesauri hierarchies. - The browsing of the digitized newspaper
collection given a set of predefined thesauri
hierarchies. - To deal with issues of semantic interoperability
of digitized material (conformance to
international standards). - To create a robust semantic backbone that will
allow the full implementation of the CIDOC CRM
Model.
11About CIDOC
- What is the CIDOC Conceptual Reference Model?
- An Object Oriented Ontology of about 80 classes
and 130 properties for cultural and natural
history - CRM instances can be encoded in many forms
RDBMS, ooDBMS, XML, RDF(S), OWL. - Accepted as ISO-21127 in June 2005
- The CRM
- Is not a metadata standard
- It is meant to become our language for semantic
interoperability, - It is a Conceptual Reference Model for analyzing
and designing cultural information systems - Is limited to the underlying semantics of
database schemata and document structures used in
cultural heritage and museum documentation - Does not define the terminology used to document
these data structures - Does not say what cultural institutions should
document - Aims to explain the logic of what they actually
do document
12(No Transcript)
13An Example Hierarchy E70 Stuff (Thing)
14 CIDOC Example (1) Modeling an Activity
February 1945
P82 at some time within
P7 took place at
P11 participated in
E7 Activity
Crimea Conference
P86 falls within
P67 is referred to by
E65 Creation Event
P81 ongoing throughout
P14 performed
P94 has created
15 CIDOC Example (2) Describing a composite
artifact
16CIDOC-CRM DIATHESIS implementation
Issue/Segments Relationships
17CIDOC-CRM DIATHESIS implementation Issue
Physical Features
18CIDOC-CRM DIATHESIS implementation Activity
References
19Thesauri Hierarchies
20 CIDOC based newspaper annotation
CIDOC CRM Core Ontology
Integration by Factual Relations
Donald Johanson
Discovery of Lucy
Johanson's Expedition
real world nodes (KOS)
Lucy
Hadar
Ethiopia
Benaki Museum
Documents in Digital Libraries
21The System Architecture Software Components
Apache Tomcat Application Server
Newspaper Digitization Suite
Diathesis Administrator
Diathesis Annotation Mechanism
DIATHESIS Web Search
Database
SIS-TMS Thesaurus Management System
Client Side
Server Side
22The System Architecture Workflow View
23The user interface
- FEATURES
- Fully Web Based.
- Simple to use / Easy to learn.
- Intelligent Upload / Download Mechanism.
- Workflow Control .
- Data Loss Prevention Mechanism (Temporary Local
Storage and Data Recovery). - Flexible and Ergonomic Completion of Metadata
Fields. - Automatic Highlighting of keywords in OCR Text
(Actors, Places). - Use of SVG thesauri hierarchies for the timely
completion of Vocabulary Reserved Metadata fields.
24The user interface
DIATHESIS
End User Search Mechanism
Annotation Mechanism
Administrator
Search for Subjects
Usage Stats
Search for Issues
Mass Import
System Configuration
25Demonstration Annotation Interface
26Demonstration End User Search Mechanism
27Future Directions
- Enrich the metadata creation process with
Information Extraction Techniques. - Expand the suite with complementary Deep Semantic
Annotation Capabilities (Semantic Wiki)
PHASE 1
PHASE 2
PHASE 3
DIATHESIS Semantic Wiki
Information Extraction Techniques
Material Preprocessing Phase
Shallow Semantic Annotation metadata production
phase.
Deep Semantic Annotation full CIDOC
implementation phase
28Conclusions
- The use of OCR technology in newspaper
digitization practices is a hot new technology.
However it is not capable to deal with a plethora
of issues. - Deep Semantic annotation via Semantic Web
technologies is a promising future trend. CIDOC
CRM provides the theoretical means to achieve
this. The problem is how to implement it.
Creation of deep semantic relationships that
exist within the boundaries of a single newspaper
issue is a time consuming , and therefore
expensive task. - The DIATHESIS digitization suite encapsulates a
digitization strategy towards the creation of a
vast semantic network of factual relationships
between CIDOC entities while effectively dealing
with the following issues - Digitization and Storage of Newspaper Material
- Rendering digitized material searchable on an
issue/article level basis via the use of
metadata, thesauri hierarchies and full text
queries. - Create a semantic backbone that can be used by
future implementations. - The next step Link the DIATHESIS semantic
backbone with a Semantic Wiki.
29geomark_at_ics.forth.gr martin_at_ics.forth.gr